The data used in this project are derived from gnomAD v. 2.1 exome data. Several computational operations have to be performed prior to executing this notebook. These steps include:
GTEx_Analysis_2016-01-15_v7_RSEMv1.2.22_transcript_tpm.txt.gz file is parsed in shell to create the stat.tr.full.tsv file. This file contains expression data (mean and max. expression of each transcript, and binary variables indicating whether a transcript is expressed at the mean level or maximum level across isoforms).gnomad.exomes.r2.1.1.sites.vcf.gz file is used along with stats.tr.full.tsv as input to the run_ptv2.py script that extracts PTV variant sites according to the criteria of Cassa et al., 2017. The resulting TSV file contains information about all PTVs in canonical transcripts annotated with GTEx-based expression variables.cds_ccr_intersection.bed file containing CDS coordinates, gene names and CCR percentiles. parse_ccr.py script is then used to average CCR data over exons and compute Z-score and % of maximum constraint within a gene for each exon in the data. The resulting file exones_ccrs.bed is then intersected with a BED file generated from the main PTV table (+/- 3bp up- and downstream of each PTV); and the resulting variants_ccrs_intersect.bed file is used as input to the add_ccr.py script to annotate variants with CCR values.lof_predicted.tsv file is used as input to add_aloft.py script to annotate the TSV file with ALoFT predictions.add_constraint.py script is used to do so. This script requires a VCF file with all extracted PTV sites annotated with VEP field from the original gnomAD VCF file. gnomad.v2.1.1.lof_metrics.by_transcript.txt is used to get gnomAD-derived constraint measures (LOEUF, pLI, o/e ratio for each transcript). During annotation, the script adds information about the most constrained isoform affected by a PTV (max_pLI, min_LOEUF), as well as information about potential “rescue” transcripts (i.e., the ones not affected by a PTV) to the TSV data table.all.baselevel.021620.tsv.bgz file is parsed to create the pext_avg.tsv table containing average GTEx-based pext scores. This file is then used as input to the add_avg_pext.py script that annotates variants with these scores. The resulting TSV file with PTV information (full_data_constr_pext.tsv) is used as input in the first chunk of this Rmd file.#library(STAR)
library(ggplot2)
#library(MCMCpack)
library(reshape2)
library(cowplot)
library(colorRamps)
library(Hmisc)
library(pROC)
library(randomForest)
mypal = matlab.like2(10000)
setwd("/media/barbitoff/DATA/Working issues/BI_Science/Natural Selection/Remastered")
set.seed(39)
Here, we start by aggregating summary information about total PTV AC within each gene. Allele numbers (AN) are averaged across all variants in a gene for each population to create gene-level AN annotation. Before aggregation, genes covered at less that 30x across gnomAD samples are removed. For genes without PTVs in canonical transcripts, exome average AN is used for each population.
raw_data = read.table('full_data_constr_pext.tsv', sep='\t', header=T)
cov_data = read.table('./covered_genes.tsv', header=T, sep='\t')
raw_data = raw_data[as.character(raw_data$gene) %in% as.character(cov_data$name), ]
raw_data$gene = droplevels(raw_data$gene)
ac_cols = grepl('AC_', colnames(raw_data))
an_cols = grepl('AN_', colnames(raw_data))
gene_data = data.frame(gene = sort(unique(raw_data$gene)))
for (i in colnames(raw_data)[ac_cols]){
gene_data[, i] = as.numeric(by(raw_data[, i], raw_data$gene, sum))
}
gene_data$AC_main = gene_data$AC_nfe + gene_data$AC_afr + gene_data$AC_amr +
gene_data$AC_eas + gene_data$AC_sas
for (i in colnames(raw_data)[an_cols]){
gene_data[, i] = as.numeric(by(raw_data[, i], raw_data$gene, mean))
}
gene_data$AN_main = gene_data$AN_nfe + gene_data$AN_afr + gene_data$AN_amr +
gene_data$AN_eas + gene_data$AN_sas
# Add covered genes with no pLoF
dummy_row = c(rep(0, 8), rep(colMeans(gene_data[, 10:17])))
zero_genes = unique(as.character(cov_data$name)[!(as.character(cov_data$name)
%in% as.character(gene_data$gene))])
remainder = as.data.frame(t(sapply(zero_genes, function(x) c(x, dummy_row))))
rownames(remainder) = c()
colnames(remainder) = colnames(gene_data)
final_gene_data = as.data.frame(rbind(gene_data, remainder))
num_cols = colnames(final_gene_data)[grepl('A[CN]_', colnames(final_gene_data))]
for (col in num_cols){
final_gene_data[, col] = as.numeric(as.character(final_gene_data[, col]))
}
final_gene_data$gene = as.factor(as.character(final_gene_data$gene))
final_gene_data =
final_gene_data[final_gene_data$AC_main/final_gene_data$AN_main <= 0.001, ]
# Cassa comparison
cassa = read.table('cassa_table.csv', header=T, sep='\t')
rownames(cassa) = cassa$gene_symbol
final_gene_data$U = sapply(final_gene_data$gene,
function(x) cassa[as.character(x), 'U'])
final_gene_data = final_gene_data[!(is.na(final_gene_data$U)), ]
final_gene_data[is.na(final_gene_data)] = 1
perc = ecdf(final_gene_data$U)
final_gene_data$tercile = floor(perc(final_gene_data$U) * 3)
final_gene_data$tercile = ifelse(final_gene_data$tercile < 3,
final_gene_data$tercile + 1,
final_gene_data$tercile)
table(final_gene_data$tercile)
##
## 1 2 3
## 5270 5256 5296
final_gene_data$shet_cassa = sapply(final_gene_data$gene,
function(x) cassa[as.character(x), 's_het'])
table(is.na(final_gene_data$shet_cassa))
##
## FALSE
## 15822
write.table(final_gene_data, file='full_aggregated.tsv', sep='\t',
row.names=F, quote=F)
Here, we start with investigating the gene-level AC and AN values across populations. As can be seen from the plots, distributions of per-gene allele counts are comparable in their form, while the median value in each population is proportional to the sample size (AN) with the exception of highly inbred Finnish and Ashkenazi Jew population.
final_gene_data = read.table('full_aggregated.tsv', header=T, sep='\t')
tpl = melt(final_gene_data[, 2:8])
ggplot(tpl, aes(x=variable, y=value, fill=variable)) + geom_violin(adjust=5) +
geom_boxplot(width = 0.075, outlier.shape = NA, fill='white') +
scale_y_continuous(limits=c(0, 25)) + theme_bw()
ans = data.frame(AN = colMeans(final_gene_data[, 10:16]),
pops = colnames(final_gene_data)[10:16])
ggplot(ans, aes(x=pops, y=AN)) + geom_bar(stat='identity')
We next proceed to evaluate the uniformity of PTV count (AC) distribution across populations. To this end, we utilize a \(\chi^2\) distribution. In each case, we combine AC and AN vectors into a contingency table. We calculate the \(\chi^2\) p-value for all population, five major non-inbred ones (AFR, AMR, EAS, NFE, SAS - “no_fin_asj”), and for all possible combinations of four populations from “no_fin_asj”. The latter comparison is performed to assess the extent to which each of the population contributes to the observed non-uniformity.
rownames(final_gene_data) = final_gene_data$gene
inheritance = read.table('genes_inheritance.txt', header=F, sep='\t',
stringsAsFactors = F, row.names=1)
final_gene_data$disease = sapply(rownames(final_gene_data),
function(x) ifelse(x%in% rownames(inheritance),
inheritance[x, 'V2'], 'none'))
final_gene_data$dbin = sapply(final_gene_data$disease,
function(x) ifelse(x == "none", 0, 1))
final_gene_data$whole = apply(final_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:8]),
as.numeric(elt[10:16])),
byrow=T, nrow=2))$p.value)
final_gene_data$no_fin_asj = apply(final_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:6]),
as.numeric(elt[10:14])),
byrow=T, nrow=2))$p.value)
for (i in 2:6) {
final_gene_data[, ncol(final_gene_data) + 1] = apply(final_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:6][c(2:6) != i]),
as.numeric(elt[10:14][c(10:14) != 8+i])),
byrow=T, nrow=2))$p.value)
}
colnames(final_gene_data) = c(colnames(final_gene_data)[1:24], 'no_afr',
'no_sas', 'no_amr', 'no_eas', 'no_nfe')
tpl = melt(final_gene_data[, c(1, 21:29)], id.vars=c('gene', 'disease', 'dbin'))
tpl = na.omit(tpl[tpl$value < 1e-5, ])
sum(tpl$variable == 'whole')
## [1] 2682
sum(tpl$variable == 'whole' & tpl$dbin == 1)
## [1] 514
sum(tpl$variable == 'no_fin_asj')
## [1] 1948
sum(tpl$variable == 'no_fin_asj' & tpl$dbin == 1)
## [1] 330
ggplot(tpl, aes(x=variable, fill=as.factor(dbin))) + geom_bar() +
theme_bw() + coord_flip()
write.table(final_gene_data, file='full_out.tsv', sep='\t',
row.names=F, quote=F)
On the plots, we can see that there is a substantial proportion of genes with significant non-uniformity of PTV allele counts. Moreover, a large fraction of these genes are linked to certain inherited disorders (as indicated in the OMIM catalog). Exclusion of each of the five major populations from the analysis decreases the number of genes by a similar fraction.
The full_out.tsv file written on the previous stage is used as input to the IG_params.py that optimizes the parameters of prior inverse Gaussian distribution of \(s_het\) values using the trust-constraint optimization method. Here we compare the resulting composite IG ditributions estimated from ExAC by Cassa et al., 2017; and the distribution produced by our optimization.
dIG = function(x, alpha, beta) {
fc = (beta/(2*pi*(x^3)))^(1/2)
sc = exp(-((beta * (x - alpha)^2)/(2*alpha^2*x)))
return(fc * sc)
}
p1 = sapply(seq(0.0001, 1, by=0.0001),
function(x) dIG(x, beta=0.0091, alpha=0.288))
p2 = sapply(seq(0.0001, 1, by=0.0001),
function(x) dIG(x, beta=0.0296, alpha=0.945))
p3 = sapply(seq(0.0001, 1, by=0.0001),
function(x) dIG(x, beta=0.0176, alpha=0.249))
#[(array([0.91184562, 0.00710335]), 10478.644733658062),
#(array([0.91368573, 0.00788945]), 16651.85653419709),
#(array([0.93319117, 0.01253176]), 15636.08337229499)]
p_g1 = sapply(seq(0.0001, 1, by=0.0001),
function(x) dIG(x, beta=0.0103, alpha=0.089))
p_g2 = sapply(seq(0.0001, 1, by=0.0001),
function(x) dIG(x, beta=0.0104, alpha=0.908))
p_g3 = sapply(seq(0.0001, 1, by=0.0001),
function(x) dIG(x, beta=0.0193, alpha=0.079))
dist = data.frame(s = seq(0.0001, 1, 0.0001), ExAC = (p1 + p2 + p3)/3,
gnomAD = (p_g1 + p_g2 + p_g3)/3)
tpl = melt(dist, id.vars='s')
ggplot(tpl, aes(x=s, y=value, col=variable, fill=variable)) + geom_line(lwd=1) + scale_x_log10() +
theme_bw() + ylab('Probability density') + xlab('Selection coefficient')# + facet_wrap(~variable, nrow=5)
Here we can see that the gnomAD-derived prior has slightly lower mean value and less pronounced tails of the distribuion (lower variance).
The inferred \(alpha\) and \(beta\) parameters of the inverse Gaussian distribution are then used in the estimate_S.py script to find the best fitting \(s_het\) value for each gene. The resulting dataframe (full_corr_wshet.tsv)
shet_gn_data = read.table('./full_corr_wshet.tsv', sep=',', header=T)
# Filtering out lowest values
# shet_gn_data = shet_gn_data[shet_gn_data$s_main > 0.0001, ]
head(shet_gn_data)
## gene AC_afr AC_sas AC_amr AC_eas AC_nfe AC_fin AC_asj AC_main AN_afr
## 1 A1BG 2 4 9 3 8 0 0 26 13535.08
## 2 A1CF 3 4 0 3 16 1 0 26 16178.32
## 3 A2M 7 6 12 2 36 1 1 63 15299.74
## 4 A3GALT2 101 7 16 0 9 0 0 133 10777.08
## 5 A4GALT 4 1 4 4 1 0 0 14 16133.50
## 6 AAAS 13 10 9 2 50 1 0 84 16203.04
## AN_sas AN_amr AN_eas AN_nfe AN_fin AN_asj AN_main U
## 1 27767.69 31261.85 16013.23 94906.00 18892.92 9129.385 183483.8 1.00e-06
## 2 29613.58 33776.21 18215.05 112025.26 21390.11 9798.842 209808.4 1.90e-06
## 3 29853.16 33740.52 17773.35 110526.77 21338.84 9914.258 207193.5 3.98e-06
## 4 25616.15 27697.54 13308.15 84884.62 16312.77 8251.077 162283.5 7.35e-07
## 5 30609.50 34567.75 18360.25 112987.75 21544.50 10041.500 212658.8 4.22e-07
## 6 30601.44 34575.84 18384.88 113552.80 21620.72 10065.920 213318.0 1.97e-06
## tercile shet_cassa disease dbin whole no_fin_asj no_afr
## 1 2 0.0067941873 none 0 5.714157e-02 1.276412e-01 6.633933e-02
## 2 3 0.0223128341 none 0 2.371968e-01 2.600837e-01 1.757584e-01
## 3 3 0.0131856312 AD 1 8.213802e-02 3.044170e-01 2.948463e-01
## 4 2 0.0009017516 none 0 9.546033e-257 3.387014e-221 1.321018e-05
## 5 1 0.0097130065 none 0 1.272265e-04 2.041963e-04 6.948237e-04
## 6 3 0.0044380677 AR 1 2.822639e-04 1.037874e-02 9.782535e-02
## no_sas no_amr no_eas no_nfe a b
## 1 6.636039e-02 5.984579e-01 6.810947e-02 6.137303e-01 0.9447102 0.029639015
## 2 1.497725e-01 9.706194e-01 1.597203e-01 1.354497e-01 0.2493024 0.017659407
## 3 3.308379e-01 2.003532e-01 5.024637e-01 1.817065e-01 0.2493024 0.017659407
## 4 1.209703e-195 2.882767e-209 7.016287e-202 1.516873e-101 0.9447102 0.029639015
## 5 1.964107e-04 2.600499e-05 2.649756e-04 1.657890e-01 0.2881401 0.009142775
## 6 6.046026e-03 1.391559e-02 3.732420e-02 3.347410e-03 0.2493024 0.017659407
## s_main
## 1 0.0072
## 2 0.0148
## 3 0.0129
## 4 0.0010
## 5 0.0061
## 6 0.0050
ggplot(shet_gn_data, aes(x=shet_cassa, y=s_main)) +
geom_hex(aes(fill=log10(..count..))) +
scale_x_log10(limits=c(0.0001, 1)) +
scale_y_log10(limits=c(0.0001, 1)) +
scale_fill_gradientn(colours = mypal) +
xlab('Selection coefficient (Cassa et al., 2017)') +
ylab('Selection coefficient (gnomAD)')
shet_gn_data$log_s_cassa = -log10(shet_gn_data$shet_cassa)
shet_gn_data$log_s = -log10(shet_gn_data$s_main)
fit_1 <- lm(log_s ~ log_s_cassa, shet_gn_data)
summary(fit_1)
##
## Call:
## lm(formula = log_s ~ log_s_cassa, data = shet_gn_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7447 -0.1532 0.0505 0.2086 2.1945
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.745664 0.007583 98.33 <2e-16 ***
## log_s_cassa 0.580487 0.004223 137.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3479 on 15820 degrees of freedom
## Multiple R-squared: 0.5442, Adjusted R-squared: 0.5442
## F-statistic: 1.889e+04 on 1 and 15820 DF, p-value: < 2.2e-16
# Correlation = 0.740, CI = [0.733, 0.747]
cor(shet_gn_data$log_s, shet_gn_data$log_s_cassa)
## [1] 0.737732
cor.test(shet_gn_data$log_s, shet_gn_data$log_s_cassa)
##
## Pearson's product-moment correlation
##
## data: shet_gn_data$log_s and shet_gn_data$log_s_cassa
## t = 137.45, df = 15820, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7305479 0.7447528
## sample estimates:
## cor
## 0.737732
s_comp = melt(shet_gn_data, id.vars='gene',
measure.vars = c('s_main', 'shet_cassa'))
s_comp$variable = ifelse(s_comp$variable == 's_main',
'gnomAD', 'ExAC')
a <- ggplot(s_comp, aes(x=value, fill=variable)) +
geom_histogram(col='black') +
theme_bw() + facet_wrap(~variable, nrow=2) +
scale_x_log10() +
xlab('Selection coefficient') + ylab('Count') +
guides(fill=F)
b <- ggplot(s_comp, aes(x=variable, y=value, fill=variable)) +
geom_boxplot(col='black', outlier.shape = NA) +
theme_bw() + scale_y_log10() +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
ylab('Selection coefficient') + xlab('Dataset') +
guides(fill=F)
plot_grid(a, b, ncol=2, rel_widths = c(1, 0.45),
labels = c('a', 'b'))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
aggregate(value~variable, s_comp, median)
## variable value
## 1 ExAC 0.01899661
## 2 gnomAD 0.01900000
aggregate(value~variable, s_comp, IQR)
## variable value
## 1 ExAC 0.05864579
## 2 gnomAD 0.03780000
Next, we use the full_corr_wshet.csv file as input to the CPL.py script that estimates the likelihood of PTV count distribution model and compares such observation likelihood (cross-population likelihood, CPL) to emprcially estimated distribution of the same likelihood scores in 100 random samples. The results of this anlaysis are writeen to the full_out_boot_stats.csv file that is used in this chunk.
In the following piece of code, we will load the data and explore the relationship between the Z-score of PTV count distribution likelihood and the \(\chi^2\) p-value. We also analyze whether the goodness of fit of the model, and explore whether the genes with significant non-uniformity of PTV counts (termed “violator” genes for the rest of the analysis) are present among disease-relevant genes.
cpl_head = read.table('full_out_boot_stats.csv', row.names=1, header=T, sep=',')
cpl_head = cpl_head[!(is.na(cpl_head$L) | is.na(cpl_head$L_replicate)), ]
cpl_head = cpl_head[cpl_head$s_main > 0.0001 & cpl_head$L != Inf, ]
L_comp = melt(cpl_head, id.vars='gene', measure.vars = c('L', 'L_replicate'))
ggplot(L_comp, aes(x=variable, y=value, fill=variable)) +
geom_violin(col='black') +
theme_bw()# + facet_wrap(~variable, nrow=2)
# Observed vs expected
# ggplot(cpl_head, aes(x=L, y=L_replicate)) + geom_hex(aes(fill=log10(..count..))) +
# scale_y_continuous(limits=c(-5, 280)) + theme_bw() +
# xlab("Observed likelihood") + ylab("Expected likelihood") +
# scale_fill_gradientn(colours = mypal)
#
# ggplot(cpl_head, aes(x=as.numeric(L))) + geom_histogram(col='black', fill='red') +
# theme_bw()
# Q-Q
quant = data.frame(observed = sort(cpl_head$L),
expected = sort(cpl_head$L_replicate))
all <- ggplot(quant, aes(x=expected, y=observed)) + geom_point() +
theme_bw() + scale_y_continuous(limits=c(0, 160)) + scale_x_continuous(limits=c(0, 20)) +
xlab('Expected likelihhod') + ylab('Observed likelihood') +
geom_abline(slope=1, col='red', lwd=1)
# Q-Q in large shet
quant = data.frame(observed = sort(cpl_head[cpl_head$s_main > 0.02, ]$L),
expected = sort(cpl_head[cpl_head$s_main > 0.02, ]$L_replicate))
high_cons <- ggplot(quant, aes(x=expected, y=observed)) + geom_point() +
theme_bw() +
xlab('Expected likelihhod') + ylab('Observed likelihood') +
geom_abline(slope=1, col='red', lwd=1)
plot_grid(all, high_cons, nrow=1)
# CPL vs chisq
rownames(cpl_head) = as.character(cpl_head$gene)
cpl_head$log_chi_p = -log10(cpl_head$no_fin_asj)
#cpl_head = na.omit(cpl_head)
cpl_head = cpl_head[cpl_head$log_chi_p != Inf | is.na(cpl_head$log_chi_p), ]
inheritance = read.table('genes_inheritance.txt', header=F, sep='\t',
stringsAsFactors = F, row.names=1)
cpl_head$disease = sapply(as.character(cpl_head$gene),
function(x) ifelse(x%in% rownames(inheritance),
inheritance[x, 'V2'], 'none'))
cpl_head$dbin = sapply(cpl_head$disease,
function(x) ifelse(x == "none", 0, 1))
ggplot(cpl_head, aes(log_chi_p, L)) +
geom_hex(aes(fill=log10(..count..))) +
theme_bw() +
xlab('chi-squared p-value') + ylab('Observed likelihood') +
scale_fill_gradientn(colours = mypal)
#OLS (z-score transform)
cpl_head$OLS = abs(cpl_head$L_mean - cpl_head$L)/cpl_head$L_sd
#ggplot(cpl_head, aes(x=OLS)) + geom_histogram(fill='red', col='black') + theme_bw()
ggplot(cpl_head, aes(log_chi_p, OLS)) +
geom_hex(aes(fill=log10(..count..))) +
theme_bw() +
xlab('chi-squared p-value') + ylab('Likelihood Z-score (OLS)') +
scale_fill_gradientn(colours = mypal)
cor(as.numeric(cpl_head$OLS), as.numeric(cpl_head$log_chi_p))
## [1] NA
# CPL vs selection
ggplot(cpl_head, aes(x=s_main, y=OLS)) + geom_point() + scale_x_log10() +
geom_hline(yintercept = 14, col='red', lwd=1) + theme_bw()
# ggplot(cpl_head, aes(x=s_main, y=OLS, col=disease)) +
# geom_point() + scale_x_log10() +
# geom_hline(yintercept = 6, col='red', lwd=1) + theme_bw() +
# facet_wrap(~disease, nrow=3) + guides(col=F) + xlab('Selection coefficient') +
# ylab('Observed likelihod')
# OLS vs Shet aspect ration 5:4.5
ggplot(cpl_head, aes(x=s_main, y=OLS)) +
geom_hex(aes(fill=log10(..count..))) + scale_x_log10() +
scale_fill_gradientn(colours=mypal) +
theme_bw() +
geom_vline(xintercept = 0.006, col='red', lwd=1, lty=2) +
facet_wrap(~disease, nrow=3) + guides(col=F) + xlab('Selection coefficient') +
ylab('Likelihood Z-score (OLS)')
summary(glm(dbin ~ OLS, cpl_head, family="binomial"))
##
## Call:
## glm(formula = dbin ~ OLS, family = "binomial", data = cpl_head)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6731 -0.6717 -0.6697 -0.6462 2.2525
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.369335 0.021494 -63.708 < 2e-16 ***
## OLS -0.009109 0.002930 -3.109 0.00188 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 15637 on 15685 degrees of freedom
## Residual deviance: 15626 on 15684 degrees of freedom
## AIC: 15630
##
## Number of Fisher Scoring iterations: 4
summary(glm(dbin ~ s_main + OLS, cpl_head, family="binomial"))
##
## Call:
## glm(formula = dbin ~ s_main + OLS, family = "binomial", data = cpl_head)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0843 -0.6684 -0.6556 -0.6339 2.1698
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.443070 0.026487 -54.482 < 2e-16 ***
## s_main 1.706462 0.348657 4.894 9.86e-07 ***
## OLS -0.006848 0.002895 -2.366 0.018 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 15637 on 15685 degrees of freedom
## Residual deviance: 15603 on 15683 degrees of freedom
## AIC: 15609
##
## Number of Fisher Scoring iterations: 4
# Caret training
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
cpl_head$d_class = as.factor(ifelse(cpl_head$dbin == 1,
'implicated',
'non_implicated'))
train_control <- trainControl(method="cv", number=5, classProbs=T,
summaryFunction=twoClassSummary)
model_s = train(d_class~s_main, data=cpl_head, trControl=train_control,
method="rpart", na.action = na.pass)
model_z = train(d_class~OLS, data=cpl_head, trControl=train_control,
method="rpart", na.action = na.pass)
model_s_z = train(d_class~s_main + OLS, data=cpl_head, trControl=train_control,
method="rpart", na.action = na.pass)
print(model_s)
## CART
##
## 15686 samples
## 1 predictor
## 2 classes: 'implicated', 'non_implicated'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 12549, 12549, 12548, 12549, 12549
## Resampling results across tuning parameters:
##
## cp ROC Sens Spec
## 7.553583e-05 0.5507940 0.017656501 0.9834540
## 8.025682e-05 0.5514132 0.016693419 0.9844084
## 2.174701e-04 0.5503480 0.008988764 0.9894996
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 8.025682e-05.
print(model_s_z)
## CART
##
## 15686 samples
## 2 predictor
## 2 classes: 'implicated', 'non_implicated'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 12549, 12548, 12549, 12549, 12549
## Resampling results across tuning parameters:
##
## cp ROC Sens Spec
## 0.0004815409 0.5646896 0.017977528 0.9928405
## 0.0005350455 0.5471842 0.013804173 0.9956249
## 0.0005961935 0.5357769 0.009309791 0.9965795
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.0004815409.
print(model_z)
## CART
##
## 15686 samples
## 1 predictor
## 2 classes: 'implicated', 'non_implicated'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 12548, 12549, 12549, 12549, 12549
## Resampling results across tuning parameters:
##
## cp ROC Sens Spec
## 0.0002568218 0.5172177 0.0134831461 0.9886251
## 0.0003210273 0.5091709 0.0022471910 0.9973747
## 0.0004414125 0.5033555 0.0006420546 0.9998409
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.0002568218.
write.table(cpl_head, file='final_full_gene_stats.tsv',
sep='\t', row.names=F, quote=F)
One of the major sources of non-uniformity of PTV counts in highly conserved genes might be the presence of misannotated pLoF alleles. This, at least partially, might cause deviations in the gene-level constraint metrics’ relationship (for example, measures based on site count as opposed to allele count might show higher conservation for “violator” genes). We check this assumption by explicitly comparing the relationships between \(s_{het}\) and pLI (probability of loss-of-function intolerance) and LOEUF (pLoF observed/expected ratio upper fraction).
gene_cons = read.table('gnomad.v2.1.1.lof_metrics.by_gene.txt', header=T, sep='\t')
cpl_head$pLI = as.numeric(sapply(as.character(cpl_head$gene), function(x)
as.numeric(gene_cons[gene_cons$gene == as.character(x), 'pLI'][1])))
cpl_head$loeuf = as.numeric(sapply(as.character(cpl_head$gene), function(x)
as.numeric(gene_cons[gene_cons$gene == as.character(x), 'oe_lof_upper'][1])))
cpl_head$loeuf = ifelse(cpl_head$loeuf > 1, 1, cpl_head$loeuf)
cpl_head$violator = ifelse(cpl_head$OLS > 5, 'yes', 'no')
#cpl_head = read.table('comprehensive_gene_annotation.tsv', sep='\t', header=T)
write.table(cpl_head, 'comprehensive_gene_annotation.tsv',
sep='\t', row.names=F, quote=F)
After pre-processing, first let’s examine the relationship between pLI and LOEUF measures provided by gnomAD, and then - the relationship between these measures and \(s_{het}\) for “violator” and “non-violator” genes..
ggplot(cpl_head, aes(x=loeuf, y=pLI)) +
geom_hex(aes(fill=log10(..count..))) +
theme_bw() + scale_fill_gradientn(colours = mypal)
loeuf <- ggplot(cpl_head, aes(x=s_main, y=loeuf)) +
geom_hex(aes(fill=log10(..count..))) +
theme_bw() + scale_x_log10() +
scale_fill_gradientn(colours = mypal) +
facet_wrap(~violator, nrow=2) + guides(fill=F)
pli <- ggplot(cpl_head, aes(x=s_main, y=pLI)) +
geom_hex(aes(fill=log10(..count..))) +
theme_bw() + scale_x_log10() +
scale_fill_gradientn(colours = mypal) +
facet_wrap(~violator, nrow=2) + guides(fill=F)
plot_grid(pli, loeuf)
The latter plots look like a logistic relationship, so let’s fit a logistic model and then explore the fits for “violator” and “non-violator” genes.
loeuf_s <- ggplot(cpl_head, aes(x=s_main, y=loeuf, col=violator)) +
geom_smooth(lwd=1, method="glm", fullrange=T,
method.args = list(family = "binomial")) +
theme_bw() + scale_x_log10() +
xlab('Selection coefficient') +
ylab('LOEUF')
pLI_s <- ggplot(cpl_head, aes(x=s_main, y=pLI, col=violator)) +
geom_smooth(lwd=1, method="glm", fullrange=T,
method.args = list(family = "binomial")) +
theme_bw() + scale_x_log10() +
xlab('Selection coefficient') +
ylab('pLI')
plot_grid(loeuf_s, pLI_s, nrow=2)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
As can be seen from the plots in this section, the relationship between \(s_{het}\) and pLI/LOEUF is indeed altered in “violator” genes, suggesting that these genes harbor a proportion of misannotated pLoF alleles in canonical transcripts.
Here, we re-load the variant dataset and append several annotations from the gene-level dataframe used in the previous steps.
var_data = read.table('full_data_constr_pext.tsv', sep='\t', header=T)
var_data$OLS = sapply(as.character(var_data$gene),
function(x) ifelse(x %in% as.character(cpl_head$gene),
as.numeric(cpl_head[x, 'OLS']), NA))
var_data = var_data[!is.na(var_data$OLS), ]
var_data$OLS = as.numeric(var_data$OLS)
var_data$s_main = sapply(as.character(var_data$gene),
function(x) ifelse(x %in% as.character(cpl_head$gene),
as.numeric(cpl_head[x, 's_main']), NA))
var_data$AC_main = var_data$AC_afr + var_data$AC_amr +
var_data$AC_eas + var_data$AC_sas + var_data$AC_nfe
var_data$violator = ifelse(var_data$OLS > 6, 'yes', 'no')
var_data$splice = grepl('splice', var_data$conseq)
Most of the comparisons in the next subsections are carried out on a subset of the data (only genes with \(s_{het} > 0.02\)), as genetic drift might have a substantial contribution to PTV count non-uniformity in genes with weaker constraint (please refer to one of the last sections in this document for a proof of this concept on the level of functional variant properties).
var_data = var_data[var_data$s_main > 0.02, ]
violators = var_data[var_data$violator == 'yes', ]
non_violators = var_data[var_data$violator == 'no', ]
Let’s first look at the CCR percentile values not normalized for averaged gene-level constraint. We are looking at the characteristics of individual variants within the respecitve gene group (“violator” and “non-violator”):
ggplot(var_data, aes(x=ccr.exon_pct, y=OLS)) +
geom_hex(aes(fill=log10(..count..))) +
scale_fill_gradientn(colours=mypal) +
theme_bw()
ggplot(var_data, aes(x=ccr.exon_pct, fill=violator)) +
geom_density(alpha=0.5) + theme_bw()
ggplot(var_data, aes(x=violator, y=ccr.exon_pct, fill=violator)) +
geom_violin(alpha=0.5) + theme_bw() +
geom_boxplot(width=0.4, col='black', lwd=0.5, fill='white',
outlier.shape=NA)
wilcox.test(violators$ccr.exon_pct, non_violators$ccr.exon_pct)
##
## Wilcoxon rank sum test with continuity correction
##
## data: violators$ccr.exon_pct and non_violators$ccr.exon_pct
## W = 5024002, p-value = 0.006462
## alternative hypothesis: true location shift is not equal to 0
aggregate(ccr.exon_pct ~ violator, var_data, median)
## violator ccr.exon_pct
## 1 no 43.47114
## 2 yes 44.53617
aggregate(ccr.exon_pct ~ violator, var_data, IQR)
## violator ccr.exon_pct
## 1 no 19.10943
## 2 yes 18.30978
We can see that certain differences exist; at the same time, these differences are subtle and explain a very low proportion of the variance. Hence, let’s consider the exon-level constraint normalized to the mean or maximum constraint of the exons within a gene (the difference between the exon constraint and the mean exon constraint within in a gene, and the exon-level constraint as % of max. value, respectively):
# CCR pct gene-based Z
ggplot(var_data, aes(x=ccr_exon_z, y=OLS)) +
geom_hex(aes(fill=log10(..count..))) +
scale_fill_gradientn(colours=mypal) +
theme_bw()
ggplot(var_data, aes(x=ccr_exon_z, fill=violator)) +
geom_density(alpha=0.5) + theme_bw()
ggplot(var_data, aes(x=ccr_exon_frac, y=OLS)) +
geom_hex(aes(fill=log10(..count..))) +
scale_fill_gradientn(colours=mypal) +
theme_bw() +
scale_y_continuous(limits=c(0, 15))
ggplot(var_data, aes(x=ccr_exon_frac, fill=violator)) +
geom_density(alpha=0.5) + theme_bw()
a <- ggplot(var_data, aes(x=violator, y=ccr_exon_frac, fill=violator)) +
geom_violin(alpha=0.5) + theme_bw() +
geom_boxplot(width=0.4, col='black', lwd=0.5, fill='white',
outlier.shape=NA) +
guides(fill=F) + ylab('Exon constraint (%)')
print(a)
wilcox.test(violators$ccr_exon_frac, non_violators$ccr_exon_frac)
##
## Wilcoxon rank sum test with continuity correction
##
## data: violators$ccr_exon_frac and non_violators$ccr_exon_frac
## W = 3752066, p-value = 4.368e-08
## alternative hypothesis: true location shift is not equal to 0
We can see that, nicely, there is a more pronounced difference in the exon constraint as % of the maximum value within a gene between variants inside “violator” and “non-violator” genes. Moreover, there seems to be a difference in the proportion of variants located in the exon with the highest constraint. Let’s chack it explicitly:
top_cons = as.data.frame(rbind(binconf(sum(na.omit(violators)$ccr_exon_frac == 1),
nrow(na.omit(violators))),
binconf(sum(na.omit(non_violators)$ccr_exon_frac == 1),
nrow(na.omit(non_violators)))))
top_cons$violator = factor(c('yes', 'no'))
b_1 <- ggplot(top_cons, aes(x=violator, y=PointEst,
fill=violator)) +
geom_bar(col='black', stat='identity') +
geom_errorbar(width=0.4, lwd=0.5, aes(ymin=Lower, ymax=Upper)) +
theme_bw() + guides(fill=F) +
ylab('pLoF in most\nconserved exon, %')
print(b_1)
We see that, indeed, there is a difference; at the same time, the significance of the difference is not that large due to the low number of variants in the samples.
Before looking at the individual isoform-related features of pLoF variants, let’s ask whether there is an enrichment of genes with multiple isoforms among “violator” genes. Let’s compare the per-gene number of exons and transcriptional isoforms for “violator” and “non-violator” genes. Here we use the merged_counts.tsv file with gene-level exon and isoform stats. This file was constructed from the GENCODE v19 annotation using a set of shell commands.
gene_data = read.table('./comprehensive_gene_annotation.tsv', sep='\t',
header=T)
iso_data = read.table('merged_counts.tsv', sep='\t', header=F)
colnames(iso_data) = c('isoforms', 'gene', 'exons')
head(iso_data)
## isoforms gene exons
## 1 2 A1BG 9
## 2 8 A1CF 15
## 3 3 A2M 36
## 4 5 A2ML1 38
## 5 2 A3GALT2 5
## 6 3 A4GALT 3
ggplot(iso_data, aes(x=exons, y=isoforms)) +
geom_hex(aes(fill=log10(..count..))) +
scale_fill_gradientn(colours=mypal) +
theme_bw() +
scale_x_continuous(limits=c(0,50))
cor(iso_data$isoforms, iso_data$exons, use='complete.obs')
## [1] 0.4942551
cor.test(iso_data$isoforms, iso_data$exons, use='complete.obs')
##
## Pearson's product-moment correlation
##
## data: iso_data$isoforms and iso_data$exons
## t = 80.951, df = 20272, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4837813 0.5045874
## sample estimates:
## cor
## 0.4942551
rownames(iso_data) = iso_data$gene
gene_data$isoforms = sapply(as.character(gene_data$gene),
function(x) iso_data[x, 'isoforms'])
gene_data$exons = sapply(as.character(gene_data$gene),
function(x) iso_data[x, 'exons'])
iso <- ggplot(gene_data[gene_data$s_main > 0.02, ],
aes(x=violator, y=isoforms, fill=violator)) +
geom_violin(alpha=0.5) +
geom_boxplot(outlier.shape=NA, width=0.3, fill='white') + theme_bw() +
scale_y_continuous(limits=c(0, 30)) + guides(fill=F)
aggregate(isoforms~violator, gene_data[gene_data$s_main > 0.02, ], mean)
## violator isoforms
## 1 no 4.949112
## 2 yes 8.595745
exo <- ggplot(gene_data[gene_data$s_main > 0.02, ],
aes(x=violator, y=exons, fill=violator)) +
geom_violin(alpha=0.5) +
geom_boxplot(outlier.shape=NA, width=0.3, fill='white') + theme_bw() +
scale_y_continuous(limits=c(0, 75)) + guides(fill=F)
aggregate(exons~violator, gene_data[gene_data$s_main > 0.02, ], mean)
## violator exons
## 1 no 13.36721
## 2 yes 38.63830
plot_grid(iso, exo, nrow=1)
We can clearly see that violator genes have substantially higher numbers of exons and isoforms, suggesting that alternative splicing does indeed play a role in LoF misannotation that leads to PTV model violation.
Now let’s explore the expression levels of pLoF-affected isoforms. First, let’s examine the proportion of pLoF variants affecting the most expressed transcript. For this purpose, we have two different variables in the data: is_mean denotes whether pLoF affects a transcript with the highest average level across transcripts, and is_max indicates whether a pLoF variant affects the isoform that has the highest expression level (maximum of maximum levels).
means = as.data.frame(rbind(binconf(sum(violators$is_mean), nrow(violators)),
binconf(sum(non_violators$is_mean), nrow(non_violators))))
means$violator = c('yes', 'no')
b_2 <- ggplot(means, aes(x=violator, y=PointEst,
fill=violator)) +
geom_bar(col='black', stat='identity') +
geom_errorbar(width=0.4, lwd=0.5, aes(ymin=Lower, ymax=Upper)) +
theme_bw() + guides(fill=F) +
ylab('pLoF in isoform\nwith high TPM, %')
print(b_2)
maxes = as.data.frame(rbind(binconf(sum(violators$is_max), nrow(violators)),
binconf(sum(non_violators$is_max), nrow(non_violators))))
maxes$violator = c('yes', 'no')
b_3 <- ggplot(maxes, aes(x=violator, y=PointEst,
fill=violator)) +
geom_bar(col='black', stat='identity') +
geom_errorbar(width=0.4, lwd=0.5, aes(ymin=Lower, ymax=Upper)) +
theme_bw() + guides(fill=F) +
ylab('pLoF in isoform\nwith max. TPM, %')
print(b_3)
The difference is very evident, the pLoF variants in “violator” genes tend to not affect isoforms with mean or maximum expression across tissues (despite these transcripts being annotated as canonical). Let’s look at the most important features seen so far:
plot_grid(a, b_1, b_2, b_3, nrow=1)
We can also see that the splice site variants are slightly enriched in “violator” genes:
binconf(sum(violators$splice), nrow(violators))
## PointEst Lower Upper
## 0.4790576 0.4294194 0.5291128
binconf(sum(non_violators$splice), nrow(non_violators))
## PointEst Lower Upper
## 0.4023218 0.3971293 0.4075362
Now let’s move on to examine two other variant-level feature related to the constraint measures of particular affected isoforms. Let’s examine the maximum pLI value and the lowest LOEUF value for isoforms affected by pLoF variants:
pli <- ggplot(var_data, aes(x=violator, y=max_pLI, fill=violator)) +
geom_violin(alpha=0.5) + theme_bw() +
geom_boxplot(width=0.4, col='black', lwd=0.5, fill='white',
outlier.shape=NA) +
guides(fill=F) + ylab('Maximum trascript pLI')
print(pli)
loeuf <- ggplot(var_data, aes(x=violator, y=min_LOEUF, fill=violator)) +
geom_violin(alpha=0.5) + theme_bw() +
geom_boxplot(width=0.4, col='black', lwd=0.5, fill='white',
outlier.shape=NA) +
guides(fill=F) + ylab('Minimum transcript LOEUF')
print(loeuf)
Surprisingly enough, we can see that pLoF variants in “violator” genes tend to affect more constrained isofors. But can that be just a spurious effect of the higher actual constraint for “violator” genes? Let’s look at the proportion of variants affecting only poorly conserved isoforms. To do so, let’s examine the proportion of variant with minimum LOEUF of affected transcripts greater that 0.75:
low_cons_tr = as.data.frame(rbind(binconf(sum(na.omit(violators)$min_LOEUF > 0.75),
nrow(na.omit(violators))),
binconf(sum(na.omit(non_violators)$min_LOEUF > 0.75),
nrow(na.omit(non_violators)))))
low_cons_tr$violator = c('yes', 'no')
oe_1 <- ggplot(low_cons_tr, aes(x=violator, y=PointEst, fill=violator)) +
geom_bar(col='black', stat='identity') +
geom_errorbar(aes(ymin=Lower, ymax=Upper), width=0.4, lwd=0.5) +
theme_bw() + guides(fill=F) +
ylab('pLoF in poorly\nconserved transcript, %')
print(oe_1)
No, it seems that, somehow, pLoF variants in “violator” genes do really affect slightly higher proportion of highly constrained isoforms. It’s strange, but let’s move on to some other measures.
We can hypothesize that, while pLoF in “violator” genes affect more conserved isoforms, there exist a large fraction of transcriptional isoforms that are not affected by a pLoF variant and are still highly conserved. Let’s examine the proportion of variants having such “rescue” isoforms, as well as the highest constraint of such “rescue” isoforms.
rescuer = as.data.frame(rbind(binconf(sum(na.omit(violators)$rescue == "True"),
nrow(na.omit(violators))),
binconf(sum(na.omit(non_violators)$rescue == "True"),
nrow(na.omit(non_violators)))))
rescuer$violator = factor(c('yes', 'no'))
r_1 <- ggplot(rescuer, aes(x=violator, y=PointEst,
fill=violator)) +
geom_bar(col='black', stat='identity') +
geom_errorbar(width=0.4, lwd=0.5, aes(ymin=Lower, ymax=Upper)) +
theme_bw() + guides(fill=F) +
ylab('% variants with a rescue isoform')
#print(r_1)
r_2 <- ggplot(var_data, aes(x=violator, y=rescue_loeuf_min, fill=violator)) +
geom_violin(alpha=0.5) + theme_bw() +
geom_boxplot(width=0.4, col='black', lwd=0.5, fill='white',
outlier.shape=NA) +
guides(fill=F) + ylab('Min. rescue LOEUF')
#print(r_2)
resc_cons = as.data.frame(rbind(binconf(sum(na.omit(violators)$rescue_loeuf_min < 1),
nrow(na.omit(violators))),
binconf(sum(na.omit(non_violators)$rescue_loeuf_min < 1),
nrow(na.omit(non_violators)))))
resc_cons$violator = factor(c('yes', 'no'))
r_3 <- ggplot(resc_cons, aes(x=violator, y=PointEst,
fill=violator)) +
geom_bar(col='black', stat='identity') +
geom_errorbar(width=0.4, lwd=0.5, aes(ymin=Lower, ymax=Upper)) +
theme_bw() + guides(fill=F) +
ylab('Rescue isoform conserved, %')
#print(r_3)
r_4 <- ggplot(var_data, aes(x=violator, y=rescue_oe_mis, fill=violator)) +
geom_violin(alpha=0.5) + theme_bw() +
geom_boxplot(width=0.4, col='black', lwd=0.5, fill='white',
outlier.shape=NA) +
guides(fill=F) + ylab('Min. rescue LOEUF')
#print(r_4)
plot_grid(r_1, r_2, r_3, r_4, nrow=1)
We can see that, indeed, pLoF variants in “violator” genes have higher chance of having a rescue isoform, and the constraint levels of such rescue isoforms is also slightly higher compared to variants in “non-violator” genes.
All in all, these data suggest that isoform expression and exon-level constraint play a role in misannotation of pLoF variants in violator genes as genuine LoF alleles. We’ll turn to trying to predict the variant “LoF-ness” a bit later, now let’s conduct a couple of additional in silico experiments to address the nature of “bad” variant in violator genes.
Here we try to remove the splice variants out of the analysis and examine whether our functional evidence plots will change. Basically, we repeat all the plots on a subset of the original data with splice-site variants removed: [Notion - we are only plotting several most important features]
var_data_sr = var_data[var_data$splice == F, ]
violators_sr = var_data_sr[var_data_sr$violator == 'yes', ]
non_violators_sr = var_data[var_data_sr$violator == 'no', ]
a <- ggplot(var_data_sr, aes(x=violator, y=ccr_exon_frac, fill=violator)) +
geom_violin(alpha=0.5) + theme_bw() +
geom_boxplot(width=0.4, col='black', lwd=0.5, fill='white',
outlier.shape=NA) +
guides(fill=F) + ylab('Exon constraint (%)')
#print(a)
top_cons = as.data.frame(rbind(binconf(sum(na.omit(violators_sr)$ccr_exon_frac == 1),
nrow(na.omit(violators_sr))),
binconf(sum(na.omit(non_violators_sr)$ccr_exon_frac == 1),
nrow(na.omit(non_violators_sr)))))
top_cons$violator = c('yes', 'no')
b_1 <- ggplot(top_cons, aes(x=violator, y=PointEst, fill=violator)) +
geom_bar(col='black', stat='identity') +
geom_errorbar(aes(ymin=Lower, ymax=Upper), width=0.4, lwd=0.5) +
theme_bw() + guides(fill=F) +
ylab('pLoF in most\nconserved exon, %')
means = as.data.frame(rbind(binconf(sum(violators_sr$is_mean), nrow(violators_sr)),
binconf(sum(non_violators_sr$is_mean), nrow(non_violators_sr))))
means$violator = c('yes', 'no')
b_2 <- ggplot(means, aes(x=violator, y=PointEst, fill=violator)) +
geom_bar(col='black', stat='identity') +
geom_errorbar(aes(ymin=Lower, ymax=Upper), width=0.4, lwd=0.5) +
theme_bw() + guides(fill=F) +
ylab('pLoF in isoform\nwith high TPM, %')
maxes = as.data.frame(rbind(binconf(sum(violators_sr$is_max), nrow(violators_sr)),
binconf(sum(non_violators_sr$is_max), nrow(non_violators_sr))))
maxes$violator = c('yes', 'no')
b_3 <- ggplot(maxes, aes(x=violator, y=PointEst, fill=violator)) +
geom_bar(col='black', stat='identity') +
geom_errorbar(aes(ymin=Lower, ymax=Upper), width=0.4, lwd=0.5) +
theme_bw() + guides(fill=F) +
ylab('pLoF in isoform\nwith max. TPM, %')
plot_grid(a, b_1, b_2, b_3, nrow=1)
Nothing ever changes - that means that splice site variants are not the only problem in the dataset.
One other hypothesis that can be made is that the observed non-uniformity of PTV count distribution stems from the presence of a single most common PTV that is misannotated as a genuine LoF variant. To check this hypothesis, we remove the most common variant and repeat the analysis using a new table. Variant removal is done using the remove_top_variant.py script. A function is used in this chuck that repeats the code used above. For convenience, we do not include the definition of a function to the HTML.
threshold = 0.02
print(generate_plots(0.02, 'top_vars_removed.tsv'))
Well, actually the removal of top variants does not solve the issue but rather making it slightly worse!
Finally, we’ll use the generate_plots function to see how the functional differences between violator and non-violator genes change when we consider genes with lower levels of the overall constraint. To this end, we’ll generate some plots with different \(s_{het}\) thresholds:
x = lapply(c(0.001, 0.006, 0.02), function(x) generate_plots(x, 'full_data_constr_pext.tsv'))
plot_grid(x[[1]], x[[2]], x[[3]], nrow=3)
We can nicely see that we see few differences in our variables for all genes and for genes with moderate to high constraint (\(s > 0.006\)), but we see a sharp distinction between variants in highly (\(s > 0.02\)) constrained violator and non-violator genes. This means that LoF misannotation we identified is truly more pronounced in genes with high level of evolutionary constraint.
We first explore whether ALoFT scores can explain the observed violation of PTV model. If it does, we shall see the increase of “LoF tolerated” probability among variants in model-violating genes. First, let’s look at the relationship between \(s_{het}\) and ALoFT scores:
ggplot(var_data, aes(x=s_main, y=aloft_dom)) +
geom_hex(aes(fill=log10(..count..))) + scale_x_log10() +
scale_fill_gradientn(colours=mypal) +
theme_bw()
ggplot(var_data, aes(x=s_main, y=aloft_tol)) +
geom_hex(aes(fill=log10(..count..))) + scale_x_log10() +
scale_fill_gradientn(colours=mypal) +
theme_bw()
We see that there is a certain relationship which suggests that ALoFT is more dependent on the general constraint of the gene rather than on a particular variant’s properties. Let’s move to examining the scores for a dominant LoF, recessive LoF, and tolerated LoF given by ALoFT for our variants in the \(s_{het} > 0.02\) bin:
#ALoFT scores
ad <- ggplot(var_data, aes(x=violator, y=aloft_dom, fill=violator)) +
geom_boxplot(outlier.shape=NA) +
theme_bw() + guides(fill=F) +
scale_y_continuous(limits = c(0, 1))
# Only recessive shows difference (strangely), dominant - in the opposite direction
ar <- ggplot(var_data, aes(x=violator, y=aloft_rec, fill=violator)) +
geom_boxplot(outlier.shape=NA) +
theme_bw() + guides(fill=F) +
scale_y_continuous(limits = c(0, 1))
at <- ggplot(var_data, aes(x=violator, y=aloft_tol, fill=violator)) +
geom_boxplot(outlier.shape=NA) +
theme_bw() + guides(fill=F) +
scale_y_continuous(limits = c(0, 1))
plot_grid(ad, ar, at, nrow=1)
table(violators$aloft_tol > 0.25)
##
## FALSE TRUE
## 200 14
table(non_violators$aloft_tol > 0.25)
##
## FALSE TRUE
## 20099 1373
Interestingly, we see that ALoFT predicts pLoFs in “violator” genes to be more likely the dominant LoF variants (and less likely to be tolerated - that’s just the opposite of what we’ve seen in the previous analyses). Now let’s move forward to exploring the the pext score proposed by gnomAD.
pext <- ggplot(var_data, aes(x=violator, y=pext_avg, fill=violator)) +
geom_violin(alpha=0.5) +
geom_boxplot(outlier.shape=NA, fill='white', width=0.4, lwd=0.5) +
theme_bw() + guides(fill=F)
print(pext)
var_data$low_pext = var_data$pext_avg < 0.1
aggregate(low_pext ~ violator, var_data, mean)
## violator low_pext
## 1 no 0.04017949
## 2 yes 0.21465969
aggregate(low_pext ~ violator, var_data, length)
## violator low_pext
## 1 no 34097
## 2 yes 382
aggregate(low_pext ~ violator, var_data, sum)
## violator low_pext
## 1 no 1370
## 2 yes 82
wilcox.test(pext_avg~violator, var_data)
##
## Wilcoxon rank sum test with continuity correction
##
## data: pext_avg by violator
## W = 9989590, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
unique(na.omit(var_data)[(na.omit(var_data)$low_pext == TRUE & na.omit(var_data)$violator == 'yes'), ]$gene)
## [1] SPEG IFT80 DST
## 18875 Levels: A1BG A1BG-AS1 A1BG,CTD-2619J13.8 A1CF A2M A2ML1 A3GALT2 ... ZZZ3
We can see that pext scores do differ substantially between violator and non-violator genes, but the proposed 0.1 average pext threshold does not solve the issue, with only 20% of variants belonging to only 3 genes are filtered out.
It’s also important to mention that the pext scores themself are slightly correlated with the number of isoforms for a given gene. Hence, we also want to explicitly check whether the difference in pext scores is solely attributable to the differences in isoform counts:
var_data$exons = as.numeric(sapply(as.character(var_data$gene),
function(x) iso_data[x, 'exons']))
ggplot(var_data, aes(x=exons, y=pext_avg, col=violator)) + geom_point() +
facet_wrap(~violator, nrow=2)
cor.test(as.numeric(var_data$exons),
as.numeric(var_data$pext_avg),
use='complete.obs')
##
## Pearson's product-moment correlation
##
## data: as.numeric(var_data$exons) and as.numeric(var_data$pext_avg)
## t = -52.999, df = 34471, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2842232 -0.2647012
## sample estimates:
## cor
## -0.2744905
pext_vs_exon <- lm(pext_avg~exons+violator, var_data)
summary(pext_vs_exon)
##
## Call:
## lm(formula = pext_avg ~ exons + violator, data = var_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.76761 -0.21413 0.09263 0.23685 1.26493
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.738e-01 2.213e-03 349.73 <2e-16 ***
## exons -3.474e-03 6.929e-05 -50.13 <2e-16 ***
## violatoryes -1.659e-01 1.481e-02 -11.20 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.283 on 34470 degrees of freedom
## (21 observations deleted due to missingness)
## Multiple R-squared: 0.0787, Adjusted R-squared: 0.07864
## F-statistic: 1472 on 2 and 34470 DF, p-value: < 2.2e-16
We can see that pext scores are somewhat correlated with the exon count in a gene, but the linear model shows that the difference between violator and non-violator genes is not explained by the exon count itselft.
Let’s try constructing some predictive model to dissect violator and non-violator genes. Starting with a GLM:
var_data$violator = as.factor(var_data$violator)
var_data$rescue_conserved = var_data$rescue_loeuf_min < var_data$min_LOEUF
train_ind = sample(1:nrow(var_data), 0.7*nrow(var_data))
trainset = var_data[train_ind, ]
include = as.character(trainset$violator) == 'yes' | runif(nrow(trainset)) < 0.016
trainset_balanced = trainset[include, ]
testset = var_data[-train_ind, ]
my_fit <- glm(violator~ccr_exon_frac + is_mean +
is_max + rescue_conserved + pext_avg,
trainset, family="binomial", na.action = "na.exclude")
summary(my_fit)
##
## Call:
## glm(formula = violator ~ ccr_exon_frac + is_mean + is_max + rescue_conserved +
## pext_avg, family = "binomial", data = trainset, na.action = "na.exclude")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.3948 -0.1536 -0.0998 -0.0819 3.4913
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.4238 0.2523 -9.607 < 2e-16 ***
## ccr_exon_frac -1.2063 0.4003 -3.013 0.00259 **
## is_mean 0.3368 0.2354 1.431 0.15239
## is_max -0.6997 0.2212 -3.163 0.00156 **
## rescue_conservedTRUE -0.5972 0.2469 -2.419 0.01556 *
## pext_avg -2.2349 0.3350 -6.671 2.54e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2133.2 on 20222 degrees of freedom
## Residual deviance: 1987.9 on 20217 degrees of freedom
## (3922 observations deleted due to missingness)
## AIC: 1999.9
##
## Number of Fisher Scoring iterations: 8
testset$prediction = predict(my_fit, testset)
#table(testset$prediction)
pROC_obj <- roc(testset$violator, testset$prediction,
smoothed = TRUE,
# arguments for ci
ci=TRUE, ci.alpha=0.9, stratified=FALSE,
# arguments for plot
plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
print.auc=TRUE, show.thres=TRUE)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
sens.ci <- ci.se(pROC_obj)
plot(sens.ci, type="shape", col="lightblue")
violatoreg_res = data.frame(prob=pROC_obj$thresholds,
sens=pROC_obj$sensitivities,
spec=pROC_obj$specificities)
#plot(sens.ci, type="bars")
#dev.off()
The GLM produces some nice result (there is a discriminative power to the model). Now let’s try and substitute the simple GLM with a more complex approach, the random forest.
my_forest = randomForest(violator~ccr_exon_frac + is_mean +
is_max + rescue_conserved + pext_avg,
trainset, na.action = NULL)
prediction_forest = predict(my_forest, testset, type="vote")
testset$prediction_forest = prediction_forest[, 2]
pROC_obj <- roc(testset$violator, testset$prediction_forest,
smoothed = TRUE,
# arguments for ci
ci=TRUE, ci.alpha=0.9, stratified=FALSE,
# arguments for plot
plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
print.auc=TRUE, show.thres=TRUE)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
sens.ci <- ci.se(pROC_obj, specificities = pROC_obj$specificities,
sensitivities = pROC_obj$sensitivities)
plot(sens.ci, type="shape", col="lightblue")
#plot(sens.ci, type="bars")
#dev.off()
violatoRF_res = data.frame(prob=pROC_obj$thresholds,
sens=pROC_obj$sensitivities,
spec=pROC_obj$specificities)
Random forest operates better than the GLM; however, there seems to be a need for accurate choice of prediction threshold to maximize the efficiency of the predictive model. Let’s examine the effect of the model on all gnomAD PTVs (irrespective of the canonical transcript).
Let’s see what will the percent of positive predictions be in all genes (irrespective of the \(s_{het}\) bin):
all_vars = read.table('full_data_constr_pext.tsv', sep='\t', header=T)
all_vars$OLS = sapply(as.character(all_vars$gene),
function(x) ifelse(x %in% as.character(cpl_head$gene),
as.numeric(cpl_head[x, 'OLS']), NA))
all_vars = all_vars[!is.na(all_vars$OLS), ]
all_vars$OLS = as.numeric(all_vars$OLS)
all_vars$s_main = sapply(as.character(all_vars$gene),
function(x) ifelse(x %in% as.character(cpl_head$gene),
as.numeric(cpl_head[x, 's_main']), NA))
all_vars$AC_main = all_vars$AC_afr + all_vars$AC_amr +
all_vars$AC_eas + all_vars$AC_sas + all_vars$AC_nfe
all_vars$violator = ifelse(all_vars$OLS > 6, 'yes', 'no')
all_vars$splice = grepl('splice', all_vars$conseq)
all_vars$rescue_conserved = all_vars$rescue_loeuf_min < all_vars$min_LOEUF
prediction_forest_all = predict(my_forest, all_vars, type="vote")
all_vars$prediction_prob = prediction_forest_all[, 2]
all_vars$pred_class = all_vars$prediction_prob > 0.001
all_vars$violator_class = all_vars$violator == 'yes'
av_pred = aggregate(pred_class~violator, all_vars,
function(x) binconf(sum(x), length(x) - sum(is.na(x))))
prediction_forest_cons = predict(my_forest, var_data, type="vote")
var_data$prediction_prob = prediction_forest_cons[, 2]
var_data$pred_class = var_data$prediction_prob > 0.001
cv_pred = aggregate(pred_class~violator, var_data,
function(x) binconf(sum(x), length(x) - sum(is.na(x))))
pred_all = as.data.frame(as.matrix(rbind(av_pred, cv_pred)))
colnames(pred_all) = c('violator', 'PointEst', 'Lower', 'Upper')
pred_all$PointEst = as.numeric(as.character(pred_all$PointEst))
pred_all$Lower = as.numeric(as.character(pred_all$Lower))
pred_all$Upper = as.numeric(as.character(pred_all$Upper))
pred_all$gene_group = rep(c('All', 'Constrained'), each=2)
ggplot(pred_all, aes(x=violator, y=PointEst, fill=violator)) +
geom_bar(col='black', stat='identity') +
geom_errorbar(aes(ymin=Lower, ymax=Upper), lwd=0.5, width=0.4) +
theme_bw() + guides(fill=F) + xlab('Model violation') +
ylab('% predicted positive') + facet_wrap(~gene_group, nrow=1)
Perfectly, we see that even in the complete dataset without \(s_{het}\)-based filters, the model predicts significantly more pLoF variants in model-violating genes as non-LoF alleles. In numbers:
table(all_vars[all_vars$s_main < 0.02, ]$pred_class)
##
## FALSE TRUE
## 72361 13596
a = as.data.frame(as.matrix(aggregate(pred_class~violator,
all_vars[all_vars$s_main < 0.02, ],
function(x) binconf(sum(x), length(x) - sum(is.na(x))))))
print(a)
## violator pred_class.1 pred_class.2 pred_class.3
## 1 no 0.1580236 0.1552892 0.1607970
## 2 yes 0.1587113 0.1535253 0.1640386
all_vars$s_bin = ifelse(all_vars$s_main > 0.02, 'high',
ifelse(all_vars$s_main > 0.006, 'mediaum', 'low'))
rf_bin_stats = as.data.frame(as.matrix(aggregate(pred_class~s_bin+violator,
all_vars,
function(x) binconf(sum(x), length(x) - sum(is.na(x))))))
print(rf_bin_stats)
## s_bin violator pred_class.1 pred_class.2 pred_class.3
## 1 high no 0.10276874 0.09930083 0.10634345
## 2 low no 0.14244978 0.13708369 0.14798990
## 3 mediaum no 0.16274847 0.15959365 0.16595333
## 4 high yes 0.95035461 0.91840438 0.97020009
## 5 low yes 0.15505164 0.14911868 0.16117593
## 6 mediaum yes 0.16945917 0.15902094 0.18043557
We see several problems with the RF predictions: * First, the proportion of variants predicted as low-confidence LoF is actually lower in violator than in non-violator genes with \(s_{het} < 0.02\). Given the functional evidence we’ve seen, and some of the examples, there still is an increased burden of lcLoF variants in genes with low constraint - hence, the RF does not recognize some of the important properties here. * Second, the proportion of variants predicted as lcLoF does not depend on the \(s_{het}\) bin properly. We actually expect that the higher is the evolutionary constraint, the higher the proportion of lcLoF in gnomAD there would be (simply because hcLoFs in highly constrained genes are disease-causing or lethal).
Given these issues, it is also important to check whether our predictor simply dissects genes with low and high number of exons and isoforms:
all_vars$exons = as.numeric(sapply(as.character(all_vars$gene),
function(x) iso_data[x, 'exons']))
a <- ggplot(na.omit(var_data), aes(x=pred_class, y=exons)) + geom_boxplot()
b <- ggplot(na.omit(all_vars), aes(x=pred_class, y=exons)) + geom_boxplot()
plot_grid(a, b, nrow=1)
Well, there is some difference in the number of exons for genes harboring variants predicted as low-conf LoF, but the degree of the difference is not dramatic. However, this might be an explanation of the strange RF validation results. Hence, let’s switch back to the GLM framework.
Now let’s look at the performance of GLM (in its best balanced form, 70% sensitivity and 79% specificity):
all_vars$glm_prediction = predict(my_fit, all_vars)
all_vars$glm_class = all_vars$glm_prediction > -4.558241
av_glm_pred = aggregate(glm_class~violator, all_vars,
function(x) binconf(sum(x), length(x) - sum(is.na(x))))
var_data$glm_prediction = predict(my_fit, var_data)
var_data$glm_class = var_data$glm_prediction > -4.558241
cv_glm_pred = aggregate(glm_class~violator, var_data,
function(x) binconf(sum(x), length(x) - sum(is.na(x))))
pred_all = as.data.frame(as.matrix(rbind(av_glm_pred, cv_glm_pred)))
colnames(pred_all) = c('violator', 'PointEst', 'Lower', 'Upper')
pred_all$PointEst = as.numeric(as.character(pred_all$PointEst))
pred_all$Lower = as.numeric(as.character(pred_all$Lower))
pred_all$Upper = as.numeric(as.character(pred_all$Upper))
pred_all$gene_group = rep(c('All', 'Constrained'), each=2)
ggplot(pred_all, aes(x=violator, y=PointEst, fill=violator)) +
geom_bar(col='black', stat='identity') +
geom_errorbar(aes(ymin=Lower, ymax=Upper), lwd=0.5, width=0.4) +
theme_bw() + guides(fill=F) + xlab('Model violation') +
ylab('% predicted positive by GLM') + facet_wrap(~gene_group, nrow=1)
table(all_vars[all_vars$s_main < 0.02, ]$glm_class)
##
## FALSE TRUE
## 63179 22778
a = as.data.frame(as.matrix(aggregate(glm_class~violator,
all_vars[all_vars$s_main < 0.02, ],
function(x) binconf(sum(x), length(x) - sum(is.na(x))))))
print(a)
## violator glm_class.1 glm_class.2 glm_class.3
## 1 no 0.2589213 0.2556280 0.2622421
## 2 yes 0.2870380 0.2805748 0.2935894
all_vars$s_bin = ifelse(all_vars$s_main > 0.02, 'high',
ifelse(all_vars$s_main > 0.006, 'medium', 'low'))
bin_stats = as.data.frame(as.matrix(aggregate(glm_class~s_bin+violator,
all_vars,
function(x) binconf(sum(x), length(x) - sum(is.na(x))))))
print(bin_stats)
## s_bin violator glm_class.1 glm_class.2 glm_class.3
## 1 high no 0.2826840 0.2774919 0.2879346
## 2 low no 0.2417464 0.2351299 0.2484886
## 3 medium no 0.2649178 0.2611339 0.2687366
## 4 high yes 0.6063830 0.5482988 0.6616078
## 5 low yes 0.2663393 0.2590421 0.2737660
## 6 medium yes 0.3478261 0.3343602 0.3615397
We can see that here is a nice relationship between constraint and the proportion of low confidence LoF variants. Let’s what we’ll have with the pext filter alone:
all_vars$pext_class = all_vars$pext_avg < 0.1
pext_bin_stats = as.data.frame(as.matrix(aggregate(pext_class~s_bin, all_vars,
function(x) binconf(sum(x), length(x) - sum(is.na(x))))))
print(pext_bin_stats)
## s_bin pext_class.1 pext_class.2 pext_class.3
## 1 high 0.04211259 0.04004311 0.04428409
## 2 low 0.05352740 0.05130104 0.05584469
## 3 medium 0.03970088 0.03825843 0.04119539
No relationship is there. Quite notably, the important ‘rs139297920’ variant in the PAX3 gene is also marked as low-confidence LoF by our GLM but is not filtered by pext score alone:
The rs139297920 variant in PAX3.
Let’s do some more exploration of the GLM prediction scores.
ggplot(all_vars, aes(x=violator, y=glm_prediction, fill=violator)) +
geom_violin(alpha=0.5) +
geom_boxplot(fill='white', outlier.shape=NA, lwd=0.5, width=0.4) +
theme_bw() + facet_wrap(~s_bin)
Given the evidence, GLM seems more accurate, however, the same GLM with pext alone as a predictor performs pretty much similar:
pext_fit = glm(violator~pext_avg, trainset, family='binomial',
na.action = "na.exclude")
summary(pext_fit)
##
## Call:
## glm(formula = violator ~ pext_avg, family = "binomial", data = trainset,
## na.action = "na.exclude")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.3345 -0.1649 -0.1006 -0.0791 3.4311
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8553 0.1052 -27.14 <2e-16 ***
## pext_avg -3.0324 0.2086 -14.54 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2873.2 on 24136 degrees of freedom
## Residual deviance: 2642.2 on 24135 degrees of freedom
## (8 observations deleted due to missingness)
## AIC: 2646.2
##
## Number of Fisher Scoring iterations: 8
testset$prediction_pext = predict(pext_fit, testset)
#table(testset$prediction)
pROC_obj <- roc(testset$violator, testset$prediction_pext,
smoothed = TRUE,
# arguments for ci
ci=TRUE, ci.alpha=0.9, stratified=FALSE,
# arguments for plot
plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
print.auc=TRUE, show.thres=TRUE)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
sens.ci <- ci.se(pROC_obj)
plot(sens.ci, type="shape", col="lightblue")
Let’s try and construct an RF model training it on a dataset with artifically balanced classes:
my_forest_b = randomForest(violator~ccr_exon_frac + is_mean +
is_max + rescue_conserved + pext_avg,
trainset_balanced, na.action = NULL)
prediction_forest_b = predict(my_forest_b, testset, type="vote")
testset$prediction_forest_b = prediction_forest_b[, 2]
pROC_obj <- roc(testset$violator, testset$prediction_forest_b,
smoothed = TRUE,
# arguments for ci
ci=TRUE, ci.alpha=0.9, stratified=FALSE,
# arguments for plot
plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
print.auc=TRUE, show.thres=TRUE)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
sens.ci <- ci.se(pROC_obj, specificities = pROC_obj$specificities,
sensitivities = pROC_obj$sensitivities)
plot(sens.ci, type="shape", col="lightblue")
#plot(sens.ci, type="bars")
#dev.off()
violatoRF_b_res = data.frame(prob=pROC_obj$thresholds,
sens=pROC_obj$sensitivities,
spec=pROC_obj$specificities)
save(my_forest_b, file='classifier.RData')
ROC shows it performs great! Slightly worse compared to an unbalanced RF, but better than a GLM. Now let’s investigate the percentages of positive predictions and also the relationship with constraint in a gene. Here, we evaluate the performance using a loose cutoff that allows for higher sensitivity (93%) and lower specificity (61%).
prediction_forest_all = predict(my_forest_b, all_vars, type="vote")
all_vars$prediction_prob_b = prediction_forest_all[, 2]
all_vars$pred_class_b = all_vars$prediction_prob_b > 0.25
all_vars$violator_class = all_vars$violator == 'yes'
av_pred = aggregate(pred_class_b~violator, all_vars,
function(x) binconf(sum(x), length(x) - sum(is.na(x))))
prediction_forest_cons = predict(my_forest_b, var_data, type="vote")
var_data$prediction_prob_b = prediction_forest_cons[, 2]
var_data$pred_class_b = var_data$prediction_prob_b > 0.25
cv_pred = aggregate(pred_class_b~violator, var_data,
function(x) binconf(sum(x), length(x) - sum(is.na(x))))
pred_all = as.data.frame(as.matrix(rbind(av_pred, cv_pred)))
colnames(pred_all) = c('violator', 'PointEst', 'Lower', 'Upper')
pred_all$PointEst = as.numeric(as.character(pred_all$PointEst))
pred_all$Lower = as.numeric(as.character(pred_all$Lower))
pred_all$Upper = as.numeric(as.character(pred_all$Upper))
pred_all$gene_group = rep(c('All', 'Constrained'), each=2)
ggplot(pred_all, aes(x=violator, y=PointEst, fill=violator)) +
geom_bar(col='black', stat='identity') +
geom_errorbar(aes(ymin=Lower, ymax=Upper), lwd=0.5, width=0.4) +
theme_bw() + guides(fill=F) + xlab('Model violation') +
ylab('% predicted positive') + facet_wrap(~gene_group, nrow=1)
a = as.data.frame(as.matrix(aggregate(pred_class_b~violator,
all_vars[all_vars$s_main < 0.02, ],
function(x) binconf(sum(x), length(x) - sum(is.na(x))))))
print(a)
## violator pred_class_b.1 pred_class_b.2 pred_class_b.3
## 1 no 0.2735069 0.2701545 0.2768852
## 2 yes 0.2755630 0.2691823 0.2820365
all_vars$s_bin = ifelse(all_vars$s_main > 0.02, 'high',
ifelse(all_vars$s_main > 0.006, 'medium', 'low'))
bin_stats = as.data.frame(as.matrix(aggregate(pred_class_b~s_bin+violator,
all_vars,
function(x) binconf(sum(x), length(x) - sum(is.na(x))))))
print(bin_stats)
## s_bin violator pred_class_b.1 pred_class_b.2 pred_class_b.3
## 1 high no 0.2897896 0.2845576 0.2950782
## 2 low no 0.2529624 0.2462408 0.2598043
## 3 medium no 0.2804683 0.2766149 0.2843542
## 4 high yes 0.9290780 0.8930001 0.9536230
## 5 low yes 0.2589731 0.2517441 0.2663357
## 6 medium yes 0.3242842 0.3110705 0.3377840
And the behavior of the model seems to comply with the expectation. Importantly, it filters out not only the aforementioned variant in PAX3, but also some interesting variants (PAX6, TERF1).
Let’s systematically examine positivity rate per \(s_{het}\) bin with different prediction probability cutoffs.
all_vars$RF_class_25 = all_vars$prediction_prob_b > 0.25
all_vars$RF_class_50 = all_vars$prediction_prob_b > 0.5
all_vars$RF_class_75 = all_vars$prediction_prob_b > 0.75
snippet = all_vars[, c('violator', 's_bin', 'RF_class_25',
'RF_class_50', 'RF_class_75', 'pext_class')]
snippet_molten = na.omit(melt(snippet, id.vars=c('violator', 's_bin')))
cutoff_stats = as.data.frame(as.matrix(aggregate(value~s_bin+violator+variable,
snippet_molten,
function(x) binconf(sum(x), length(x) - sum(is.na(x))))))
colnames(cutoff_stats) = c('s_bin', 'violator', 'classifier', 'PointEst',
'Lower', 'Upper')
cutoff_stats$PointEst = as.numeric(as.character(cutoff_stats$PointEst))
cutoff_stats$Lower = as.numeric(as.character(cutoff_stats$Lower))
cutoff_stats$Upper = as.numeric(as.character(cutoff_stats$Upper))
cutoff_stats$s_bin = factor(cutoff_stats$s_bin,
levels=c('low', 'medium', 'high'))
print(cutoff_stats)
## s_bin violator classifier PointEst Lower Upper
## 1 high no RF_class_25 0.28978963 0.28455757 0.29507822
## 2 low no RF_class_25 0.25296242 0.24624075 0.25980433
## 3 medium no RF_class_25 0.28046825 0.27661485 0.28435423
## 4 high yes RF_class_25 0.92907801 0.89300014 0.95362302
## 5 low yes RF_class_25 0.25897306 0.25174412 0.26633570
## 6 medium yes RF_class_25 0.32428420 0.31107051 0.33778398
## 7 high no RF_class_50 0.12090028 0.11717079 0.12473169
## 8 low no RF_class_50 0.07800520 0.07392300 0.08229279
## 9 medium no RF_class_50 0.10755887 0.10491902 0.11025696
## 10 high yes RF_class_50 0.67730496 0.62067297 0.72917131
## 11 low yes RF_class_50 0.09236658 0.08765636 0.09730291
## 12 medium yes RF_class_50 0.13722163 0.12769524 0.14733868
## 13 high no RF_class_75 0.04182856 0.03956805 0.04421226
## 14 low no RF_class_75 0.02566377 0.02330966 0.02824876
## 15 medium no RF_class_75 0.03917554 0.03753816 0.04088131
## 16 high yes RF_class_75 0.39716312 0.34180393 0.45528638
## 17 low yes RF_class_75 0.03444789 0.03153703 0.03761700
## 18 medium yes RF_class_75 0.05683987 0.05058462 0.06381665
## 19 high no pext_class 0.04017949 0.03814633 0.04231624
## 20 low no pext_class 0.04868144 0.04578085 0.05175584
## 21 medium no pext_class 0.03949843 0.03799601 0.04105773
## 22 high yes pext_class 0.21465969 0.17643388 0.25856722
## 23 low yes pext_class 0.05896585 0.05559707 0.06252523
## 24 medium yes pext_class 0.04189944 0.03701045 0.04740247
ggplot(cutoff_stats, aes(x=s_bin, y=PointEst, fill=s_bin)) +
geom_bar(col='black', stat='identity') +
geom_errorbar(aes(ymin=Lower, ymax=Upper), lwd=0.5, width=0.4) +
theme_bw() + guides(fill=F) + xlab('Conservation level') +
ylab('% predicted positive') +
facet_grid(vars(violator), vars(classifier))
# No violator status
cutoff_stats_2 = as.data.frame(as.matrix(aggregate(value~s_bin+variable,
snippet_molten,
function(x) binconf(sum(x), length(x) - sum(is.na(x))))))
colnames(cutoff_stats_2) = c('s_bin', 'classifier', 'PointEst',
'Lower', 'Upper')
cutoff_stats_2$PointEst = as.numeric(as.character(cutoff_stats_2$PointEst))
cutoff_stats_2$Lower = as.numeric(as.character(cutoff_stats_2$Lower))
cutoff_stats_2$Upper = as.numeric(as.character(cutoff_stats_2$Upper))
cutoff_stats_2$s_bin = factor(cutoff_stats_2$s_bin,
levels=c('low', 'medium', 'high'))
cutoff_stats_2$classifier = factor(cutoff_stats_2$classifier,
levels=c('RF_class_25', 'RF_class_50',
'RF_class_75', 'pext_class'))
print(cutoff_stats_2)
## s_bin classifier PointEst Lower Upper
## 1 high RF_class_25 0.29603827 0.29079806 0.30133278
## 2 low RF_class_25 0.25577157 0.25083551 0.26077095
## 3 medium RF_class_25 0.28412592 0.28042135 0.28785985
## 4 high RF_class_50 0.12633878 0.12255485 0.13022219
## 5 low RF_class_50 0.08471716 0.08160001 0.08794199
## 6 medium RF_class_50 0.11003506 0.10748078 0.11264237
## 7 high RF_class_75 0.04530172 0.04296195 0.04776257
## 8 low RF_class_75 0.02976914 0.02789410 0.03176610
## 9 medium RF_class_75 0.04065012 0.03905252 0.04231019
## 10 high pext_class 0.04211259 0.04004311 0.04428409
## 11 low pext_class 0.05352740 0.05130104 0.05584469
## 12 medium pext_class 0.03970088 0.03825843 0.04119539
ggplot(cutoff_stats_2, aes(x=s_bin, y=PointEst, fill=s_bin)) +
geom_bar(col='black', stat='identity') +
geom_errorbar(aes(ymin=Lower, ymax=Upper), lwd=0.5, width=0.4) +
theme_bw() + guides(fill=F) + xlab('Conservation level') +
ylab('% predicted positive') +
facet_wrap(~classifier, nrow=1) + scale_y_continuous(limits=c(0, 1))
ggplot(cutoff_stats_2, aes(x=classifier, y=PointEst, fill=classifier)) +
geom_bar(col='black', stat='identity') +
geom_errorbar(aes(ymin=Lower, ymax=Upper), lwd=0.5, width=0.4) +
theme_bw() + guides(fill=F) + xlab('Conservation level') +
ylab('% predicted positive') +
facet_wrap(~s_bin, nrow=1) + scale_y_continuous(limits=c(0, 1)) +
theme(axis.text.x=element_text(angle=45, hjust=1))
Exploring the effect of filtering predicted low-confidence LoF variants on gene-level PTV allele counts:
vars_non_true = all_vars[all_vars$RF_class_25 %in% c(FALSE, NA), ]
gene_level = aggregate(AC_main ~ gene, vars_non_true, sum)
gene_level$source_ac = sapply(as.character(gene_level$gene), function(x) cpl_head[x, 'AC_main'])
ggplot(gene_level, aes(source_ac, AC_main)) +
geom_hex(aes(fill=log10(..count..))) +
theme_bw() + scale_fill_gradientn(colours = mypal)
What happens to the imbalance of PTV counts?
cov_data = read.table('./covered_genes.tsv', header=T, sep='\t')
vars_non_true = vars_non_true[as.character(vars_non_true$gene) %in% as.character(cov_data$name), ]
vars_non_true$gene = droplevels(vars_non_true$gene)
ac_cols = grepl('AC_', colnames(vars_non_true))
an_cols = grepl('AN_', colnames(vars_non_true))
gene_data = data.frame(gene = sort(unique(vars_non_true$gene)))
for (i in colnames(vars_non_true)[ac_cols]){
gene_data[, i] = as.numeric(by(vars_non_true[, i], vars_non_true$gene, sum))
}
gene_data$AC_main = gene_data$AC_nfe + gene_data$AC_afr + gene_data$AC_amr +
gene_data$AC_eas + gene_data$AC_sas
for (i in colnames(vars_non_true)[an_cols]){
gene_data[, i] = as.numeric(by(vars_non_true[, i], vars_non_true$gene, mean))
}
gene_data$AN_main = gene_data$AN_nfe + gene_data$AN_afr + gene_data$AN_amr +
gene_data$AN_eas + gene_data$AN_sas
# Add covered genes with no pLoF
dummy_row = c(rep(0, 8), rep(colMeans(gene_data[, 10:17])))
zero_genes = unique(as.character(cov_data$name)[!(as.character(cov_data$name)
%in% as.character(gene_data$gene))])
remainder = as.data.frame(t(sapply(zero_genes, function(x) c(x, dummy_row))))
rownames(remainder) = c()
colnames(remainder) = colnames(gene_data)
flt_gene_data = as.data.frame(rbind(gene_data, remainder))
num_cols = colnames(flt_gene_data)[grepl('A[CN]_', colnames(flt_gene_data))]
for (col in num_cols){
flt_gene_data[, col] = as.numeric(as.character(flt_gene_data[, col]))
}
flt_gene_data$gene = as.factor(as.character(flt_gene_data$gene))
flt_gene_data =
flt_gene_data[flt_gene_data$AC_main/flt_gene_data$AN_main <= 0.001, ]
# Cassa comparison
cassa = read.table('cassa_table.csv', header=T, sep='\t')
rownames(cassa) = cassa$gene_symbol
flt_gene_data$U = sapply(flt_gene_data$gene,
function(x) cassa[as.character(x), 'U'])
flt_gene_data = flt_gene_data[!(is.na(flt_gene_data$U)), ]
flt_gene_data[is.na(flt_gene_data)] = 1
perc = ecdf(flt_gene_data$U)
flt_gene_data$tercile = floor(perc(flt_gene_data$U) * 3)
flt_gene_data$tercile = ifelse(flt_gene_data$tercile < 3,
flt_gene_data$tercile + 1,
flt_gene_data$tercile)
table(flt_gene_data$tercile)
##
## 1 2 3
## 5323 5329 5339
flt_gene_data$shet_cassa = sapply(flt_gene_data$gene,
function(x) cassa[as.character(x), 's_het'])
table(is.na(flt_gene_data$shet_cassa))
##
## FALSE
## 15991
rownames(flt_gene_data) = flt_gene_data$gene
inheritance = read.table('genes_inheritance.txt', header=F, sep='\t',
stringsAsFactors = F, row.names=1)
flt_gene_data$disease = sapply(rownames(flt_gene_data),
function(x) ifelse(x%in% rownames(inheritance),
inheritance[x, 'V2'], 'none'))
flt_gene_data$dbin = sapply(flt_gene_data$disease,
function(x) ifelse(x == "none", 0, 1))
flt_gene_data$whole = apply(flt_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:8]),
as.numeric(elt[10:16])),
byrow=T, nrow=2))$p.value)
flt_gene_data$no_fin_asj = apply(flt_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:6]),
as.numeric(elt[10:14])),
byrow=T, nrow=2))$p.value)
for (i in 2:6) {
flt_gene_data[, ncol(flt_gene_data) + 1] = apply(flt_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:6][c(2:6) != i]),
as.numeric(elt[10:14][c(10:14) != 8+i])),
byrow=T, nrow=2))$p.value)
}
colnames(flt_gene_data) = c(colnames(flt_gene_data)[1:24], 'no_afr',
'no_sas', 'no_amr', 'no_eas', 'no_nfe')
flt_gene_data$s_main = sapply(as.character(flt_gene_data$gene),
function(x) as.numeric(cpl_head[as.character(cpl_head$gene) == x,
's_main'])[1])
#flt_gene_data = flt_gene_data[flt_gene_data$s_main >= 0.02, ]
tpl = melt(flt_gene_data[, c(1, 21:29)], id.vars=c('gene', 'disease', 'dbin'))
tpl = na.omit(tpl[tpl$value < 1e-5, ])
sum(tpl$variable == 'whole')
## [1] 2185
sum(tpl$variable == 'whole' & tpl$dbin == 1)
## [1] 409
sum(tpl$variable == 'no_fin_asj')
## [1] 1581
sum(tpl$variable == 'no_fin_asj' & tpl$dbin == 1)
## [1] 261
ggplot(tpl, aes(x=variable, fill=as.factor(dbin))) + geom_bar() +
theme_bw() + coord_flip()
flt_gene_data$no_fin_asj = sapply(flt_gene_data$no_fin_asj,
function(x) ifelse(is.na(x), 1, x))
flt_gene_data$no_fin_asj_prior = sapply(as.character(flt_gene_data$gene),
function(x) as.numeric(cpl_head[as.character(cpl_head$gene) == x, 'no_fin_asj'])[1])
ggplot(flt_gene_data, aes(x=-log10(no_fin_asj_prior), y=-log10(no_fin_asj))) +
geom_hex(aes(fill=log10(..count..))) +
theme_bw() +
xlab('chi-squared p-value pre-filter') +
ylab('chi-squared p-value post-filter') +
scale_fill_gradientn(colours = mypal)
table(flt_gene_data$no_fin_asj > 1e-5 & flt_gene_data$no_fin_asj_prior < 1e-5)
##
## FALSE TRUE
## 13209 366
binconf(sum(flt_gene_data$no_fin_asj >= 1e-5 &
flt_gene_data$no_fin_asj_prior < 1e-5, na.rm=T),
sum(flt_gene_data$no_fin_asj_prior < 1e-5, na.rm=T))
## PointEst Lower Upper
## 0.1967742 0.1793396 0.2154587
Compare the percentage of genes with no imbalance after filtering with RF model compared to random exclusion of variants:
vars_random = all_vars[sample(1:nrow(all_vars), nrow(vars_non_true)), ]
cov_data = read.table('./covered_genes.tsv', header=T, sep='\t')
vars_random = vars_random[as.character(vars_random$gene) %in% as.character(cov_data$name), ]
vars_random$gene = droplevels(vars_random$gene)
ac_cols = grepl('AC_', colnames(vars_random))
an_cols = grepl('AN_', colnames(vars_random))
gene_data = data.frame(gene = sort(unique(vars_random$gene)))
for (i in colnames(vars_random)[ac_cols]){
gene_data[, i] = as.numeric(by(vars_random[, i], vars_random$gene, sum))
}
gene_data$AC_main = gene_data$AC_nfe + gene_data$AC_afr + gene_data$AC_amr +
gene_data$AC_eas + gene_data$AC_sas
for (i in colnames(vars_random)[an_cols]){
gene_data[, i] = as.numeric(by(vars_random[, i], vars_random$gene, mean))
}
gene_data$AN_main = gene_data$AN_nfe + gene_data$AN_afr + gene_data$AN_amr +
gene_data$AN_eas + gene_data$AN_sas
# Add covered genes with no pLoF
dummy_row = c(rep(0, 8), rep(colMeans(gene_data[, 10:17])))
zero_genes = unique(as.character(cov_data$name)[!(as.character(cov_data$name)
%in% as.character(gene_data$gene))])
remainder = as.data.frame(t(sapply(zero_genes, function(x) c(x, dummy_row))))
rownames(remainder) = c()
colnames(remainder) = colnames(gene_data)
random_gene_data = as.data.frame(rbind(gene_data, remainder))
num_cols = colnames(random_gene_data)[grepl('A[CN]_', colnames(random_gene_data))]
for (col in num_cols){
random_gene_data[, col] = as.numeric(as.character(random_gene_data[, col]))
}
random_gene_data$gene = as.factor(as.character(random_gene_data$gene))
random_gene_data =
random_gene_data[random_gene_data$AC_main/random_gene_data$AN_main <= 0.001, ]
# Cassa comparison
cassa = read.table('cassa_table.csv', header=T, sep='\t')
rownames(cassa) = cassa$gene_symbol
random_gene_data$U = sapply(random_gene_data$gene,
function(x) cassa[as.character(x), 'U'])
random_gene_data = random_gene_data[!(is.na(random_gene_data$U)), ]
random_gene_data[is.na(random_gene_data)] = 1
perc = ecdf(random_gene_data$U)
random_gene_data$tercile = floor(perc(random_gene_data$U) * 3)
random_gene_data$tercile = ifelse(random_gene_data$tercile < 3,
random_gene_data$tercile + 1,
random_gene_data$tercile)
table(random_gene_data$tercile)
##
## 1 2 3
## 5323 5329 5342
random_gene_data$shet_cassa = sapply(random_gene_data$gene,
function(x) cassa[as.character(x), 's_het'])
table(is.na(random_gene_data$shet_cassa))
##
## FALSE
## 15994
rownames(random_gene_data) = random_gene_data$gene
inheritance = read.table('genes_inheritance.txt', header=F, sep='\t',
stringsAsFactors = F, row.names=1)
random_gene_data$disease = sapply(rownames(random_gene_data),
function(x) ifelse(x%in% rownames(inheritance),
inheritance[x, 'V2'], 'none'))
random_gene_data$dbin = sapply(random_gene_data$disease,
function(x) ifelse(x == "none", 0, 1))
random_gene_data$whole = apply(random_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:8]),
as.numeric(elt[10:16])),
byrow=T, nrow=2))$p.value)
random_gene_data$no_fin_asj = apply(random_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:6]),
as.numeric(elt[10:14])),
byrow=T, nrow=2))$p.value)
for (i in 2:6) {
random_gene_data[, ncol(random_gene_data) + 1] = apply(random_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:6][c(2:6) != i]),
as.numeric(elt[10:14][c(10:14) != 8+i])),
byrow=T, nrow=2))$p.value)
}
colnames(random_gene_data) = c(colnames(random_gene_data)[1:24], 'no_afr',
'no_sas', 'no_amr', 'no_eas', 'no_nfe')
random_gene_data$s_main = sapply(as.character(random_gene_data$gene),
function(x) as.numeric(cpl_head[as.character(cpl_head$gene) == x,
's_main'])[1])
#random_gene_data = random_gene_data[random_gene_data$s_main >= 0.02, ]
tpl = melt(random_gene_data[, c(1, 21:29)], id.vars=c('gene', 'disease', 'dbin'))
tpl = na.omit(tpl[tpl$value < 1e-5, ])
sum(tpl$variable == 'whole')
## [1] 2226
sum(tpl$variable == 'whole' & tpl$dbin == 1)
## [1] 399
sum(tpl$variable == 'no_fin_asj')
## [1] 1623
sum(tpl$variable == 'no_fin_asj' & tpl$dbin == 1)
## [1] 255
ggplot(tpl, aes(x=variable, fill=as.factor(dbin))) + geom_bar() +
theme_bw() + coord_flip()
random_gene_data$no_fin_asj = sapply(random_gene_data$no_fin_asj,
function(x) ifelse(is.na(x), 1, x))
random_gene_data$no_fin_asj_prior = sapply(as.character(random_gene_data$gene),
function(x) as.numeric(cpl_head[as.character(cpl_head$gene) == x, 'no_fin_asj'])[1])
ggplot(random_gene_data, aes(x=-log10(no_fin_asj_prior), y=-log10(no_fin_asj))) +
geom_hex(aes(fill=log10(..count..))) +
theme_bw() +
xlab('chi-squared p-value pre-filter') +
ylab('chi-squared p-value post-filter') +
scale_fill_gradientn(colours = mypal)
table(random_gene_data$no_fin_asj > 1e-5 & random_gene_data$no_fin_asj_prior < 1e-5)
##
## FALSE TRUE
## 13176 402
binconf(sum(random_gene_data$no_fin_asj >= 1e-5 &
random_gene_data$no_fin_asj_prior < 1e-5, na.rm=T),
sum(random_gene_data$no_fin_asj_prior < 1e-5, na.rm=T))
## PointEst Lower Upper
## 0.2160129 0.1979111 0.2352847
Significantly less genes would be turned non-significant after random exclusion of variants - hence, our model allows to capture some information (though not as many as we’d probably like to). What would the percentage be if we used pext filtering?
vars_pext_flt = all_vars[!(all_vars$pext_avg < 0.25), ]
cov_data = read.table('./covered_genes.tsv', header=T, sep='\t')
vars_pext_flt = vars_pext_flt[as.character(vars_pext_flt$gene) %in% as.character(cov_data$name), ]
vars_pext_flt$gene = droplevels(vars_pext_flt$gene)
ac_cols = grepl('AC_', colnames(vars_pext_flt))
an_cols = grepl('AN_', colnames(vars_pext_flt))
gene_data = data.frame(gene = sort(unique(vars_pext_flt$gene)))
for (i in colnames(vars_pext_flt)[ac_cols]){
gene_data[, i] = as.numeric(by(vars_pext_flt[, i], vars_pext_flt$gene, sum))
}
gene_data$AC_main = gene_data$AC_nfe + gene_data$AC_afr + gene_data$AC_amr +
gene_data$AC_eas + gene_data$AC_sas
for (i in colnames(vars_pext_flt)[an_cols]){
gene_data[, i] = as.numeric(by(vars_pext_flt[, i], vars_pext_flt$gene, mean))
}
gene_data$AN_main = gene_data$AN_nfe + gene_data$AN_afr + gene_data$AN_amr +
gene_data$AN_eas + gene_data$AN_sas
# Add covered genes with no pLoF
dummy_row = c(rep(0, 8), rep(colMeans(gene_data[, 10:17])))
zero_genes = unique(as.character(cov_data$name)[!(as.character(cov_data$name)
%in% as.character(gene_data$gene))])
remainder = as.data.frame(t(sapply(zero_genes, function(x) c(x, dummy_row))))
rownames(remainder) = c()
colnames(remainder) = colnames(gene_data)
pext_gene_data = as.data.frame(rbind(gene_data, remainder))
num_cols = colnames(pext_gene_data)[grepl('A[CN]_', colnames(pext_gene_data))]
for (col in num_cols){
pext_gene_data[, col] = as.numeric(as.character(pext_gene_data[, col]))
}
pext_gene_data$gene = as.factor(as.character(pext_gene_data$gene))
pext_gene_data =
pext_gene_data[pext_gene_data$AC_main/pext_gene_data$AN_main <= 0.001, ]
# Cassa comparison
cassa = read.table('cassa_table.csv', header=T, sep='\t')
rownames(cassa) = cassa$gene_symbol
pext_gene_data$U = sapply(pext_gene_data$gene,
function(x) cassa[as.character(x), 'U'])
pext_gene_data = pext_gene_data[!(is.na(pext_gene_data$U)), ]
pext_gene_data[is.na(pext_gene_data)] = 1
perc = ecdf(pext_gene_data$U)
pext_gene_data$tercile = floor(perc(pext_gene_data$U) * 3)
pext_gene_data$tercile = ifelse(pext_gene_data$tercile < 3,
pext_gene_data$tercile + 1,
pext_gene_data$tercile)
table(pext_gene_data$tercile)
##
## 1 2 3
## 5323 5329 5342
pext_gene_data$shet_cassa = sapply(pext_gene_data$gene,
function(x) cassa[as.character(x), 's_het'])
table(is.na(pext_gene_data$shet_cassa))
##
## FALSE
## 15994
rownames(pext_gene_data) = pext_gene_data$gene
inheritance = read.table('genes_inheritance.txt', header=F, sep='\t',
stringsAsFactors = F, row.names=1)
pext_gene_data$disease = sapply(rownames(pext_gene_data),
function(x) ifelse(x%in% rownames(inheritance),
inheritance[x, 'V2'], 'none'))
pext_gene_data$dbin = sapply(pext_gene_data$disease,
function(x) ifelse(x == "none", 0, 1))
pext_gene_data$whole = apply(pext_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:8]),
as.numeric(elt[10:16])),
byrow=T, nrow=2))$p.value)
pext_gene_data$no_fin_asj = apply(pext_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:6]),
as.numeric(elt[10:14])),
byrow=T, nrow=2))$p.value)
for (i in 2:6) {
pext_gene_data[, ncol(pext_gene_data) + 1] = apply(pext_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:6][c(2:6) != i]),
as.numeric(elt[10:14][c(10:14) != 8+i])),
byrow=T, nrow=2))$p.value)
}
colnames(pext_gene_data) = c(colnames(pext_gene_data)[1:24], 'no_afr',
'no_sas', 'no_amr', 'no_eas', 'no_nfe')
pext_gene_data$s_main = sapply(as.character(pext_gene_data$gene),
function(x) as.numeric(cpl_head[as.character(cpl_head$gene) == x,
's_main'])[1])
#pext_gene_data = pext_gene_data[pext_gene_data$s_main >= 0.02, ]
tpl = melt(pext_gene_data[, c(1, 21:29)], id.vars=c('gene', 'disease', 'dbin'))
tpl = na.omit(tpl[tpl$value < 1e-5, ])
sum(tpl$variable == 'whole')
## [1] 2222
sum(tpl$variable == 'whole' & tpl$dbin == 1)
## [1] 407
sum(tpl$variable == 'no_fin_asj')
## [1] 1604
sum(tpl$variable == 'no_fin_asj' & tpl$dbin == 1)
## [1] 255
ggplot(tpl, aes(x=variable, fill=as.factor(dbin))) + geom_bar() +
theme_bw() + coord_flip()
pext_gene_data$no_fin_asj = sapply(pext_gene_data$no_fin_asj,
function(x) ifelse(is.na(x), 1, x))
pext_gene_data$no_fin_asj_prior = sapply(as.character(pext_gene_data$gene),
function(x) as.numeric(cpl_head[as.character(cpl_head$gene) == x, 'no_fin_asj'])[1])
ggplot(pext_gene_data, aes(x=-log10(no_fin_asj_prior), y=-log10(no_fin_asj))) +
geom_hex(aes(fill=log10(..count..))) +
theme_bw() +
xlab('chi-squared p-value pre-filter') +
ylab('chi-squared p-value post-filter') +
scale_fill_gradientn(colours = mypal)
table(pext_gene_data$no_fin_asj > 1e-4 & pext_gene_data$no_fin_asj_prior < 1e-5)
##
## FALSE TRUE
## 13297 281
binconf(sum(pext_gene_data$no_fin_asj >= 1e-5 &
pext_gene_data$no_fin_asj_prior < 1e-5, na.rm=T),
sum(pext_gene_data$no_fin_asj_prior < 1e-5, na.rm=T))
## PointEst Lower Upper
## 0.1558302 0.1400625 0.1730159
Now let’s take a look at what happens if we apply our model to ClinVar pathogenic variants that have multiple submitters (i.e., confident pathogenic). The proportion of positive predictions should be less than the one for gnomAD variants!
clv_data = read.table('clv_path_parsed.tsv', sep='\t', header=T)
clv_data$rescue_conserved = clv_data$rescue_loeuf_min <= clv_data$min_LOEUF
prediction_forest_clv = predict(my_forest_b, clv_data, type="vote")
clv_data$prediction_prob_b = prediction_forest_clv[, 2]
clv_data$RF_class_25 = clv_data$prediction_prob_b > 0.25
clv_data$RF_class_50 = clv_data$prediction_prob_b > 0.5
clv_data$RF_class_75 = clv_data$prediction_prob_b > 0.75
clv_data$pext_class = clv_data$pext_avg < 0.1
clv_molten = melt(clv_data[, c('RF_class_25', 'RF_class_50', 'RF_class_75')],
id.vars=c())
clv_cutoff_stats = as.data.frame(as.matrix(aggregate(value~+variable,
clv_molten,
function(x) binconf(sum(x), length(x) - sum(is.na(x))))))
print(clv_cutoff_stats)
## variable value.1 value.2 value.3
## 1 RF_class_25 0.30314961 0.28638695 0.32045282
## 2 RF_class_50 0.12491052 0.11316396 0.13768708
## 3 RF_class_75 0.03937008 0.03276877 0.04723629
gn_cutoff_stats = as.data.frame(as.matrix(aggregate(value~+variable,
snippet_molten,
function(x) binconf(sum(x), length(x) - sum(is.na(x))))))
print(gn_cutoff_stats)
## variable value.1 value.2 value.3
## 1 RF_class_25 0.27980793 0.27722039 0.28241020
## 2 RF_class_50 0.10760171 0.10582353 0.10940611
## 3 RF_class_75 0.03901323 0.03790927 0.04014800
## 4 pext_class 0.04401596 0.04295438 0.04510253
Percentages do not differ much from gnomAD (and if they do, they differ in the opposite direction). Let’s explore it by \(s_{het}\) bin.
clv_data$s_main = sapply(as.character(clv_data$gene),
function(x) ifelse(x %in% as.character(cpl_head$gene),
as.numeric(cpl_head[x, 's_main']), NA))
clv_data$s_bin = ifelse(clv_data$s_main > 0.02, 'high',
ifelse(clv_data$s_main > 0.006, 'medium', 'low'))
clv_molten = melt(clv_data[, c('RF_class_25',
'RF_class_50',
'RF_class_75',
'pext_class', 's_bin')],
id.vars=c('s_bin'))
clv_all_cutoff_stats = as.data.frame(as.matrix(aggregate(value~s_bin+variable,
clv_molten,
function(x) binconf(sum(x), length(x) - sum(is.na(x))))))
clv_all_cutoff_stats$violator = 'clinvar'
colnames(clv_all_cutoff_stats) = c('s_bin', 'classifier', 'PointEst',
'Lower', 'Upper', 'violator')
all_stats = rbind(cutoff_stats, clv_all_cutoff_stats)
all_stats$PointEst = as.numeric(as.character(all_stats$PointEst))
all_stats$Lower = as.numeric(as.character(all_stats$Lower))
all_stats$Upper = as.numeric(as.character(all_stats$Upper))
all_stats$s_bin = factor(all_stats$s_bin,
levels=c('low', 'medium', 'high'))
print(all_stats)
## s_bin violator classifier PointEst Lower Upper
## 1 high no RF_class_25 0.289789630 0.284557570 0.29507822
## 2 low no RF_class_25 0.252962420 0.246240750 0.25980433
## 3 medium no RF_class_25 0.280468250 0.276614850 0.28435423
## 4 high yes RF_class_25 0.929078010 0.893000140 0.95362302
## 5 low yes RF_class_25 0.258973060 0.251744120 0.26633570
## 6 medium yes RF_class_25 0.324284200 0.311070510 0.33778398
## 7 high no RF_class_50 0.120900280 0.117170790 0.12473169
## 8 low no RF_class_50 0.078005200 0.073923000 0.08229279
## 9 medium no RF_class_50 0.107558870 0.104919020 0.11025696
## 10 high yes RF_class_50 0.677304960 0.620672970 0.72917131
## 11 low yes RF_class_50 0.092366580 0.087656360 0.09730291
## 12 medium yes RF_class_50 0.137221630 0.127695240 0.14733868
## 13 high no RF_class_75 0.041828560 0.039568050 0.04421226
## 14 low no RF_class_75 0.025663770 0.023309660 0.02824876
## 15 medium no RF_class_75 0.039175540 0.037538160 0.04088131
## 16 high yes RF_class_75 0.397163120 0.341803930 0.45528638
## 17 low yes RF_class_75 0.034447890 0.031537030 0.03761700
## 18 medium yes RF_class_75 0.056839870 0.050584620 0.06381665
## 19 high no pext_class 0.040179490 0.038146330 0.04231624
## 20 low no pext_class 0.048681440 0.045780850 0.05175584
## 21 medium no pext_class 0.039498430 0.037996010 0.04105773
## 22 high yes pext_class 0.214659690 0.176433880 0.25856722
## 23 low yes pext_class 0.058965850 0.055597070 0.06252523
## 24 medium yes pext_class 0.041899440 0.037010450 0.04740247
## 25 high clinvar RF_class_25 0.308781870 0.281689830 0.33725616
## 26 low clinvar RF_class_25 0.165662651 0.129541800 0.20943201
## 27 medium clinvar RF_class_25 0.246753247 0.218905718 0.27688754
## 28 high clinvar RF_class_50 0.144475921 0.124585639 0.16693616
## 29 low clinvar RF_class_50 0.060240964 0.039331762 0.09121037
## 30 medium clinvar RF_class_50 0.113341204 0.093714710 0.13645915
## 31 high clinvar RF_class_75 0.037771483 0.027859680 0.05102458
## 32 low clinvar RF_class_75 0.006024096 0.001653582 0.02169511
## 33 medium clinvar RF_class_75 0.042502952 0.030856950 0.05828005
## 34 high clinvar pext_class 0.076233184 0.064084186 0.09046275
## 35 low clinvar pext_class 0.023017903 0.012156116 0.04316092
## 36 medium clinvar pext_class 0.020787746 0.013347906 0.03223890
ggplot(all_stats, aes(x=s_bin, y=PointEst, fill=s_bin)) +
geom_bar(col='black', stat='identity') +
geom_errorbar(aes(ymin=Lower, ymax=Upper), lwd=0.5, width=0.4) +
theme_bw() + guides(fill=F) + xlab('Conservation level') +
ylab('% predicted positive') +
facet_grid(vars(violator), vars(classifier))
And one more plot:
ggplot(all_stats, aes(x=s_bin, y=PointEst, fill=violator)) +
geom_bar(col='black', stat='identity', position='dodge') +
geom_errorbar(aes(ymin=Lower, ymax=Upper), lwd=0.5,
width=0.4, position=position_dodge(width=1)) +
theme_bw() + xlab('Conservation level') +
ylab('% predicted positive') +
facet_wrap(~classifier, nrow=1) +
theme(axis.text.x=element_text(angle=90, hjust=1))
Bin-by-bin analysis suggests that ClinVar variants are predicted as lcLoF at similar rates as gnomAD variants in non-violator genes; however, for violator genes the proportion is substantially higher. The \(p_{ext}\)-based filter peforms worse in discriminating violators and non-violators while having similar positive prediction rates for gnomAD non-violators and ClinVar. Same pattern can be seen using pext with different cutoffs:
mean(var_data[var_data$violator == 'no', ]$pext_avg < 0.1, na.rm=T)
## [1] 0.04017949
mean(var_data[var_data$violator == 'yes', ]$pext_avg < 0.1, na.rm=T)
## [1] 0.2146597
mean(clv_data$pext_avg < 0.1, na.rm=T)
## [1] 0.05090595
mean(var_data[var_data$violator == 'no', ]$pext_avg < 0.25, na.rm=T)
## [1] 0.1114761
mean(var_data[var_data$violator == 'yes', ]$pext_avg < 0.25, na.rm=T)
## [1] 0.408377
mean(clv_data$pext_avg < 0.25, na.rm=T)
## [1] 0.1291343
mean(var_data[var_data$violator == 'no', ]$pext_avg < 0.5, na.rm=T)
## [1] 0.270581
mean(var_data[var_data$violator == 'yes', ]$pext_avg < 0.5, na.rm=T)
## [1] 0.6858639
mean(clv_data$pext_avg < 0.5, na.rm=T)
## [1] 0.2792637
We can also investigate the discriminatory power of the model to dissect ClinVar beningn and pathogenic variants:
patho = read.table('clv_path_parsed.tsv', sep='\t', header=T)
patho$rescue_conserved = patho$rescue_loeuf_min < patho$min_LOEUF
benign = read.table('clv_benign_parsed.tsv', sep='\t', header=T)
benign$rescue_conserved = benign$rescue_loeuf_min < benign$min_LOEUF
p_pred = predict(my_forest_b, patho, type='vote')
patho$predict_prob = p_pred[, 2]
b_pred = predict(my_forest_b, benign, type='vote')
benign$predict_prob = b_pred[, 2]
patho$benign = 'no'
benign$benign = 'yes'
all_clv = as.data.frame(rbind(patho, benign))
pROC_obj <- roc(all_clv$benign, all_clv$pext_avg,
smoothed = TRUE,
# arguments for ci
ci=TRUE, ci.alpha=0.9, stratified=FALSE,
# arguments for plot
plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
print.auc=TRUE, show.thres=TRUE)
## Setting levels: control = no, case = yes
## Setting direction: controls > cases
sens.ci <- ci.se(pROC_obj)
plot(sens.ci, type="shape", col="lightblue")
pROC_obj <- roc(all_clv$benign, all_clv$predict_prob,
smoothed = TRUE,
# arguments for ci
ci=TRUE, ci.alpha=0.9, stratified=FALSE,
# arguments for plot
plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
print.auc=TRUE, show.thres=TRUE)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
sens.ci <- ci.se(pROC_obj)
plot(sens.ci, type="shape", col="lightblue")
Both our and \(p_{ext}\)-only model has low power to predict ClinVar pathogenic status. It seems that the RF model performs slightly better, but the difference is tiny and insignificant. Well, the overall performance is pretty poor.
Let’s play around with the data. Try to fit a model using all the genes rather than just a subset of \(s > 0.02\).
all_vars$violator = as.factor(all_vars$violator)
train_ind = sample(1:nrow(all_vars), 0.909*nrow(all_vars))
trainset = all_vars[train_ind, ]
include = as.character(trainset$violator) == 'yes' | runif(nrow(trainset)) < 0.22
trainset_balanced = trainset[include, ]
testset = all_vars[-train_ind, ]
my_forest_c = randomForest(violator~ccr_exon_frac + is_mean +
is_max + rescue_conserved + pext_avg,
trainset_balanced, na.action = NULL)
prediction_all_data = predict(my_forest_c, testset, type="vote")
testset$pred_all_data = prediction_all_data[, 2]
pROC_obj <- roc(testset$violator, testset$pred_all_data,
smoothed = TRUE,
# arguments for ci
ci=TRUE, ci.alpha=0.9, stratified=FALSE,
# arguments for plot
plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
print.auc=TRUE, show.thres=TRUE)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
sens.ci <- ci.se(pROC_obj, conf.level = 0.99)
plot(sens.ci, type="shape", col="lightblue")
#plot(sens.ci, type="bars")
#dev.off()
violatoRF_c_res = data.frame(prob=pROC_obj$thresholds,
sens=pROC_obj$sensitivities,
spec=pROC_obj$specificities)
save(my_forest_c, file='classifier_c.RData')
Seems OK! Is it better than the pext-only model?
ad_pext = glm(violator~pext_avg, trainset_balanced, family='binomial')
testset$pred_ad_pext = predict(ad_pext, testset)
pROC_obj <- roc(testset$violator, testset$pred_ad_pext,
smoothed = TRUE,
# arguments for ci
ci=TRUE, ci.alpha=0.9, stratified=FALSE,
# arguments for plot
plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
print.auc=TRUE, show.thres=TRUE)
## Setting levels: control = no, case = yes
## Setting direction: controls > cases
sens.ci <- ci.se(pROC_obj, conf.level = 0.99)
plot(sens.ci, type="shape", col="lightblue")
Great! \(p_{ext}\)-only model really does nothing on the complete dataset (we could have seen it before on the model analysis plots). Now let’s move on and look into the cutoff stats for the new all-data based model.
prediction_model_c = predict(my_forest_c, all_vars, type="vote")
all_vars$pred_all_data = prediction_model_c[, 2]
all_vars$RF_class_25_c = all_vars$pred_all_data > 0.25
all_vars$RF_class_50_c = all_vars$pred_all_data > 0.50
all_vars$RF_class_75_c = all_vars$pred_all_data > 0.75
snippet = all_vars[, c('violator', 's_bin', 'RF_class_25_c',
'RF_class_50_c', 'RF_class_75_c', 'pext_class')]
snippet_molten = na.omit(melt(snippet, id.vars=c('violator', 's_bin')))
cutoff_stats = as.data.frame(as.matrix(aggregate(value~s_bin+violator+variable,
snippet_molten,
function(x) binconf(sum(x), length(x) - sum(is.na(x))))))
colnames(cutoff_stats) = c('s_bin', 'violator', 'classifier', 'PointEst',
'Lower', 'Upper')
cutoff_stats$PointEst = as.numeric(as.character(cutoff_stats$PointEst))
cutoff_stats$Lower = as.numeric(as.character(cutoff_stats$Lower))
cutoff_stats$Upper = as.numeric(as.character(cutoff_stats$Upper))
cutoff_stats$s_bin = factor(cutoff_stats$s_bin,
levels=c('low', 'medium', 'high'))
print(cutoff_stats)
## s_bin violator classifier PointEst Lower Upper
## 1 high no RF_class_25_c 0.337918723 0.332456020 0.34342501
## 2 low no RF_class_25_c 0.421012610 0.413329692 0.42873397
## 3 medium no RF_class_25_c 0.387911218 0.383722141 0.39211693
## 4 high yes RF_class_25_c 0.482269504 0.424580133 0.54043544
## 5 low yes RF_class_25_c 0.607351773 0.599189297 0.61545470
## 6 medium yes RF_class_25_c 0.532555673 0.518293478 0.54676486
## 7 high no RF_class_50_c 0.122440407 0.118690055 0.12629228
## 8 low no RF_class_50_c 0.191622838 0.185557564 0.19783821
## 9 medium no RF_class_50_c 0.151737593 0.148672909 0.15485396
## 10 high yes RF_class_50_c 0.198581560 0.156208183 0.24905654
## 11 low yes RF_class_50_c 0.329529862 0.321749065 0.33740522
## 12 medium yes RF_class_50_c 0.253446448 0.241234616 0.26605970
## 13 high no RF_class_75_c 0.009310791 0.008261199 0.01049233
## 14 low no RF_class_75_c 0.022432039 0.020235213 0.02486131
## 15 medium no RF_class_75_c 0.017231055 0.016145352 0.01838840
## 16 high yes RF_class_75_c 0.039007092 0.021918210 0.06848666
## 17 low yes RF_class_75_c 0.090127826 0.085471121 0.09501188
## 18 medium yes RF_class_75_c 0.044750795 0.039210624 0.05103217
## 19 high no pext_class 0.040179488 0.038146331 0.04231624
## 20 low no pext_class 0.048681440 0.045780851 0.05175584
## 21 medium no pext_class 0.039498433 0.037996007 0.04105773
## 22 high yes pext_class 0.214659686 0.176433881 0.25856722
## 23 low yes pext_class 0.058965847 0.055597070 0.06252523
## 24 medium yes pext_class 0.041899441 0.037010447 0.04740247
ggplot(cutoff_stats, aes(x=s_bin, y=PointEst, fill=s_bin)) +
geom_bar(col='black', stat='identity') +
geom_errorbar(aes(ymin=Lower, ymax=Upper), lwd=0.5, width=0.4) +
theme_bw() + guides(fill=F) + xlab('Conservation level') +
ylab('% predicted positive') +
facet_grid(vars(violator), vars(classifier))
# No violator status
cutoff_stats_2 = as.data.frame(as.matrix(aggregate(value~s_bin+variable,
snippet_molten,
function(x) binconf(sum(x), length(x) - sum(is.na(x))))))
colnames(cutoff_stats_2) = c('s_bin', 'classifier', 'PointEst',
'Lower', 'Upper')
cutoff_stats_2$PointEst = as.numeric(as.character(cutoff_stats_2$PointEst))
cutoff_stats_2$Lower = as.numeric(as.character(cutoff_stats_2$Lower))
cutoff_stats_2$Upper = as.numeric(as.character(cutoff_stats_2$Upper))
cutoff_stats_2$s_bin = factor(cutoff_stats_2$s_bin,
levels=c('low', 'medium', 'high'))
cutoff_stats_2$classifier = factor(cutoff_stats_2$classifier,
levels=c('RF_class_25_c', 'RF_class_50_c',
'RF_class_75_c', 'pext_class'))
print(cutoff_stats_2)
## s_bin classifier PointEst Lower Upper
## 1 high RF_class_25_c 0.339329659 0.333887871 0.34481423
## 2 low RF_class_25_c 0.508100446 0.502407174 0.51379162
## 3 medium RF_class_25_c 0.399985836 0.395952632 0.40403264
## 4 high RF_class_50_c 0.123184638 0.119442448 0.12702716
## 5 low RF_class_50_c 0.256075334 0.251137305 0.26107661
## 6 medium RF_class_50_c 0.160228037 0.157226036 0.16327625
## 7 high RF_class_75_c 0.009601054 0.008539317 0.01079336
## 8 low RF_class_75_c 0.054070474 0.051552629 0.05670394
## 9 medium RF_class_75_c 0.019528345 0.018419441 0.02070260
## 10 high pext_class 0.042112590 0.040043114 0.04428409
## 11 low pext_class 0.053527400 0.051301040 0.05584469
## 12 medium pext_class 0.039700882 0.038258433 0.04119539
ggplot(cutoff_stats_2, aes(x=s_bin, y=PointEst, fill=s_bin)) +
geom_bar(col='black', stat='identity') +
geom_errorbar(aes(ymin=Lower, ymax=Upper), lwd=0.5, width=0.4) +
theme_bw() + guides(fill=F) + xlab('Conservation level') +
ylab('% predicted positive') +
facet_wrap(~classifier, nrow=1) + scale_y_continuous(limits=c(0, 1))
ggplot(cutoff_stats_2, aes(x=classifier, y=PointEst, fill=classifier)) +
geom_bar(col='black', stat='identity') +
geom_errorbar(aes(ymin=Lower, ymax=Upper), lwd=0.5, width=0.4) +
theme_bw() + guides(fill=F) + xlab('Conservation level') +
ylab('% predicted positive') +
facet_wrap(~s_bin, nrow=1) + scale_y_continuous(limits=c(0, 1)) +
theme(axis.text.x=element_text(angle=45, hjust=1))
The model operates the opposite way, i.e. it annotates more variants in less conserved genes as low-confidence pLoF. What about ClinVar?
prediction_forest_ad_clv = predict(my_forest_c, clv_data, type="vote")
clv_data$prediction_prob_c = prediction_forest_ad_clv[, 2]
clv_data$RF_class_25_c = clv_data$prediction_prob_c > 0.25
clv_data$RF_class_50_c = clv_data$prediction_prob_c > 0.50
clv_data$RF_class_75_c = clv_data$prediction_prob_c > 0.75
clv_data$pext_class = clv_data$pext_avg < 0.1
clv_molten = melt(clv_data[, c('RF_class_25_c',
'RF_class_50_c',
'RF_class_75_c',
'pext_class')],
id.vars=c())
clv_cutoff_stats = as.data.frame(as.matrix(aggregate(value~+variable,
clv_molten,
function(x) binconf(sum(x), length(x) - sum(is.na(x))))))
print(clv_cutoff_stats)
## variable value.1 value.2 value.3
## 1 RF_class_25_c 0.330708661 0.313506759 0.348375440
## 2 RF_class_50_c 0.104867573 0.094044395 0.116775792
## 3 RF_class_75_c 0.013600573 0.009924934 0.018611872
## 4 pext_class 0.050905953 0.044082722 0.058720426
gn_cutoff_stats = as.data.frame(as.matrix(aggregate(value~+variable,
snippet_molten,
function(x) binconf(sum(x), length(x) - sum(is.na(x))))))
print(gn_cutoff_stats)
## variable value.1 value.2 value.3
## 1 RF_class_25_c 0.41262689 0.40978403 0.41547559
## 2 RF_class_50_c 0.17563348 0.17344476 0.17784387
## 3 RF_class_75_c 0.02593923 0.02503610 0.02687404
## 4 pext_class 0.04401596 0.04295438 0.04510253
OK, now it seems reasonable (less ClinVar variants are annotated as low-confidence LoFs). Let’s do some magic to create a complete plot with all groups of variants with different classifiers.
cutoff_stats_2 = as.matrix(cutoff_stats_2)
clv_cutoff_stats$variable =
levels(clv_cutoff_stats$variable)[as.numeric(clv_cutoff_stats$variable)]
clv_cutoff_stats$value.1 = as.numeric(as.character(clv_cutoff_stats$value.1))
clv_cutoff_stats$value.2 = as.numeric(as.character(clv_cutoff_stats$value.2))
clv_cutoff_stats$value.3 = as.numeric(as.character(clv_cutoff_stats$value.3))
for(i in 1:4){
cutoff_stats_2 = rbind(cutoff_stats_2, c('clinvar',
as.character(clv_cutoff_stats[i, ])))
}
cutoff_stats_2 = as.data.frame(cutoff_stats_2)
colnames(cutoff_stats_2) = c('s_bin', 'classifier', 'PointEst',
'Lower', 'Upper')
cutoff_stats_2$PointEst = as.numeric(as.character(cutoff_stats_2$PointEst))
cutoff_stats_2$Lower = as.numeric(as.character(cutoff_stats_2$Lower))
cutoff_stats_2$Upper = as.numeric(as.character(cutoff_stats_2$Upper))
cutoff_stats_2$s_bin = factor(cutoff_stats_2$s_bin,
levels=c('low', 'medium', 'high', 'clinvar'))
cutoff_stats_2$classifier = factor(cutoff_stats_2$classifier,
levels=c('RF_class_25_c', 'RF_class_50_c',
'RF_class_75_c', 'pext_class'))
ggplot(cutoff_stats_2, aes(x=s_bin, y=PointEst, fill=s_bin)) +
geom_bar(col='black', stat='identity') +
geom_errorbar(aes(ymin=Lower, ymax=Upper), lwd=0.5, width=0.4) +
theme_bw() + guides(fill=F) + xlab('Conservation level') +
ylab('% predicted positive') +
facet_wrap(~classifier, nrow=1) + scale_y_continuous(limits=c(0, 1))
ggplot(cutoff_stats_2, aes(x=classifier, y=PointEst, fill=classifier)) +
geom_bar(col='black', stat='identity') +
geom_errorbar(aes(ymin=Lower, ymax=Upper), lwd=0.5, width=0.4) +
theme_bw() + guides(fill=F) + xlab('Conservation level') +
ylab('% predicted positive') +
facet_wrap(~s_bin, nrow=1) + scale_y_continuous(limits=c(0, 1)) +
theme(axis.text.x=element_text(angle=45, hjust=1))
Interestingly, we see that the proportion of predicted low-confidence pLoFs for ClinVar pathogenic variants is approximately the same as for gnomAD SNPs in the medium constraint bin. Could it be that the ClinVar variants just reside in genes that correspond to these strata? Let’s make another plot to explore this idea…
clv_data$s_main = sapply(as.character(clv_data$gene),
function(x) ifelse(x %in% as.character(cpl_head$gene),
as.numeric(cpl_head[x, 's_main']), NA))
clv_data$s_bin = ifelse(clv_data$s_main > 0.02, 'high',
ifelse(clv_data$s_main > 0.006, 'medium', 'low'))
clv_molten = melt(clv_data[, c('RF_class_25_c',
'RF_class_50_c',
'RF_class_75_c',
'pext_class', 's_bin')],
id.vars=c('s_bin'))
clv_all_cutoff_stats = as.data.frame(as.matrix(aggregate(value~s_bin+variable,
clv_molten,
function(x) binconf(sum(x), length(x) - sum(is.na(x))))))
clv_all_cutoff_stats$violator = 'clinvar'
colnames(clv_all_cutoff_stats) = c('s_bin', 'classifier', 'PointEst',
'Lower', 'Upper', 'violator')
all_stats = rbind(cutoff_stats, clv_all_cutoff_stats)
all_stats$PointEst = as.numeric(as.character(all_stats$PointEst))
all_stats$Lower = as.numeric(as.character(all_stats$Lower))
all_stats$Upper = as.numeric(as.character(all_stats$Upper))
all_stats$s_bin = factor(all_stats$s_bin,
levels=c('low', 'medium', 'high'))
print(all_stats)
## s_bin violator classifier PointEst Lower Upper
## 1 high no RF_class_25_c 0.337918723 0.332456020 0.34342501
## 2 low no RF_class_25_c 0.421012610 0.413329692 0.42873397
## 3 medium no RF_class_25_c 0.387911218 0.383722141 0.39211693
## 4 high yes RF_class_25_c 0.482269504 0.424580133 0.54043544
## 5 low yes RF_class_25_c 0.607351773 0.599189297 0.61545470
## 6 medium yes RF_class_25_c 0.532555673 0.518293478 0.54676486
## 7 high no RF_class_50_c 0.122440407 0.118690055 0.12629228
## 8 low no RF_class_50_c 0.191622838 0.185557564 0.19783821
## 9 medium no RF_class_50_c 0.151737593 0.148672909 0.15485396
## 10 high yes RF_class_50_c 0.198581560 0.156208183 0.24905654
## 11 low yes RF_class_50_c 0.329529862 0.321749065 0.33740522
## 12 medium yes RF_class_50_c 0.253446448 0.241234616 0.26605970
## 13 high no RF_class_75_c 0.009310791 0.008261199 0.01049233
## 14 low no RF_class_75_c 0.022432039 0.020235213 0.02486131
## 15 medium no RF_class_75_c 0.017231055 0.016145352 0.01838840
## 16 high yes RF_class_75_c 0.039007092 0.021918210 0.06848666
## 17 low yes RF_class_75_c 0.090127826 0.085471121 0.09501188
## 18 medium yes RF_class_75_c 0.044750795 0.039210624 0.05103217
## 19 high no pext_class 0.040179488 0.038146331 0.04231624
## 20 low no pext_class 0.048681440 0.045780851 0.05175584
## 21 medium no pext_class 0.039498433 0.037996007 0.04105773
## 22 high yes pext_class 0.214659686 0.176433881 0.25856722
## 23 low yes pext_class 0.058965847 0.055597070 0.06252523
## 24 medium yes pext_class 0.041899441 0.037010447 0.04740247
## 25 high clinvar RF_class_25_c 0.258734655 0.233263617 0.28594972
## 26 low clinvar RF_class_25_c 0.439759036 0.387358139 0.49353804
## 27 medium clinvar RF_class_25_c 0.277449823 0.248352840 0.30855639
## 28 high clinvar RF_class_50_c 0.071765817 0.057719818 0.08890737
## 29 low clinvar RF_class_50_c 0.123493976 0.092351080 0.16325006
## 30 medium clinvar RF_class_50_c 0.082644628 0.065932034 0.10312585
## 31 high clinvar RF_class_75_c 0.007554297 0.003832761 0.01483555
## 32 low clinvar RF_class_75_c 0.036144578 0.020795029 0.06210557
## 33 medium clinvar RF_class_75_c 0.004722550 0.001837994 0.01207936
## 34 high clinvar pext_class 0.076233184 0.064084186 0.09046275
## 35 low clinvar pext_class 0.023017903 0.012156116 0.04316092
## 36 medium clinvar pext_class 0.020787746 0.013347906 0.03223890
ggplot(all_stats, aes(x=s_bin, y=PointEst, fill=s_bin)) +
geom_bar(col='black', stat='identity') +
geom_errorbar(aes(ymin=Lower, ymax=Upper), lwd=0.5, width=0.4) +
theme_bw() + guides(fill=F) + xlab('Conservation level') +
ylab('% predicted positive') +
facet_grid(vars(violator), vars(classifier))
Unfortunately, they are. But now we can see that the percentage of ClinVar variants annotated as lcLoF in each bin is lower. Let’s draw one more plot to illustrate it:
ggplot(all_stats, aes(x=s_bin, y=PointEst, fill=violator)) +
geom_bar(col='black', stat='identity', position='dodge') +
geom_errorbar(aes(ymin=Lower, ymax=Upper), lwd=0.5,
width=0.4, position=position_dodge(width=1)) +
theme_bw() + xlab('Conservation level') +
ylab('% predicted positive') +
facet_wrap(~classifier, nrow=1) +
theme(axis.text.x=element_text(angle=90, hjust=1)) +
scale_y_continuous(limits=c(0, 1))
Classifying benign vs pathogenic in ClinVar with the updated model:
patho = read.table('clv_path_parsed.tsv', sep='\t', header=T)
patho$rescue_conserved = patho$rescue_loeuf_min < patho$min_LOEUF
benign = read.table('clv_benign_parsed.tsv', sep='\t', header=T)
benign$rescue_conserved = benign$rescue_loeuf_min < benign$min_LOEUF
p_pred = predict(my_forest_c, patho, type='vote')
patho$predict_prob = p_pred[, 2]
b_pred = predict(my_forest_c, benign, type='vote')
benign$predict_prob = b_pred[, 2]
patho$benign = 'no'
benign$benign = 'yes'
all_clv = as.data.frame(rbind(patho, benign))
all_clv_sub = all_clv[ifelse(as.character(all_clv$benign) == 'yes', TRUE,
runif(nrow(all_clv)) < 0.1), ]
pROC_obj <- roc(all_clv_sub$benign, all_clv_sub$pext_avg,
smoothed = TRUE,
# arguments for ci
ci=TRUE, ci.alpha=0.9, stratified=FALSE,
# arguments for plot
plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
print.auc=TRUE, show.thres=TRUE)
## Setting levels: control = no, case = yes
## Setting direction: controls > cases
sens.ci <- ci.se(pROC_obj)
plot(sens.ci, type="shape", col="lightblue")
pROC_obj <- roc(all_clv_sub$benign, all_clv_sub$predict_prob,
smoothed = TRUE,
# arguments for ci
ci=TRUE, ci.alpha=0.9, stratified=FALSE,
# arguments for plot
plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
print.auc=TRUE, show.thres=TRUE)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
sens.ci <- ci.se(pROC_obj)
plot(sens.ci, type="shape", col="lightblue")
The model seems to efficiently predict benignness of the ClinVar variants.
Now let’s see if the removal of variants with the new predictor will perform better than the random exclusion (piece of code copied from above)
vars_non_true = all_vars[all_vars$RF_class_25_c %in% c(FALSE, NA), ]
gene_level = aggregate(AC_main ~ gene, vars_non_true, sum)
gene_level$source_ac = sapply(as.character(gene_level$gene), function(x) cpl_head[x, 'AC_main'])
ggplot(gene_level, aes(source_ac, AC_main)) +
geom_hex(aes(fill=log10(..count..))) +
theme_bw() + scale_fill_gradientn(colours = mypal)
What happens to the imbalance of PTV counts?
cov_data = read.table('./covered_genes.tsv', header=T, sep='\t')
vars_non_true = vars_non_true[as.character(vars_non_true$gene) %in% as.character(cov_data$name), ]
vars_non_true$gene = droplevels(vars_non_true$gene)
ac_cols = grepl('AC_', colnames(vars_non_true))
an_cols = grepl('AN_', colnames(vars_non_true))
gene_data = data.frame(gene = sort(unique(vars_non_true$gene)))
for (i in colnames(vars_non_true)[ac_cols]){
gene_data[, i] = as.numeric(by(vars_non_true[, i], vars_non_true$gene, sum))
}
gene_data$AC_main = gene_data$AC_nfe + gene_data$AC_afr + gene_data$AC_amr +
gene_data$AC_eas + gene_data$AC_sas
for (i in colnames(vars_non_true)[an_cols]){
gene_data[, i] = as.numeric(by(vars_non_true[, i], vars_non_true$gene, mean))
}
gene_data$AN_main = gene_data$AN_nfe + gene_data$AN_afr + gene_data$AN_amr +
gene_data$AN_eas + gene_data$AN_sas
# Add covered genes with no pLoF
dummy_row = c(rep(0, 8), rep(colMeans(gene_data[, 10:17])))
zero_genes = unique(as.character(cov_data$name)[!(as.character(cov_data$name)
%in% as.character(gene_data$gene))])
remainder = as.data.frame(t(sapply(zero_genes, function(x) c(x, dummy_row))))
rownames(remainder) = c()
colnames(remainder) = colnames(gene_data)
flt_gene_data = as.data.frame(rbind(gene_data, remainder))
num_cols = colnames(flt_gene_data)[grepl('A[CN]_', colnames(flt_gene_data))]
for (col in num_cols){
flt_gene_data[, col] = as.numeric(as.character(flt_gene_data[, col]))
}
flt_gene_data$gene = as.factor(as.character(flt_gene_data$gene))
flt_gene_data =
flt_gene_data[flt_gene_data$AC_main/flt_gene_data$AN_main <= 0.001, ]
# Cassa comparison
cassa = read.table('cassa_table.csv', header=T, sep='\t')
rownames(cassa) = cassa$gene_symbol
flt_gene_data$U = sapply(flt_gene_data$gene,
function(x) cassa[as.character(x), 'U'])
flt_gene_data = flt_gene_data[!(is.na(flt_gene_data$U)), ]
flt_gene_data[is.na(flt_gene_data)] = 1
perc = ecdf(flt_gene_data$U)
flt_gene_data$tercile = floor(perc(flt_gene_data$U) * 3)
flt_gene_data$tercile = ifelse(flt_gene_data$tercile < 3,
flt_gene_data$tercile + 1,
flt_gene_data$tercile)
table(flt_gene_data$tercile)
##
## 1 2 3
## 5323 5329 5342
flt_gene_data$shet_cassa = sapply(flt_gene_data$gene,
function(x) cassa[as.character(x), 's_het'])
table(is.na(flt_gene_data$shet_cassa))
##
## FALSE
## 15994
rownames(flt_gene_data) = flt_gene_data$gene
inheritance = read.table('genes_inheritance.txt', header=F, sep='\t',
stringsAsFactors = F, row.names=1)
flt_gene_data$disease = sapply(rownames(flt_gene_data),
function(x) ifelse(x%in% rownames(inheritance),
inheritance[x, 'V2'], 'none'))
flt_gene_data$dbin = sapply(flt_gene_data$disease,
function(x) ifelse(x == "none", 0, 1))
flt_gene_data$whole = apply(flt_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:8]),
as.numeric(elt[10:16])),
byrow=T, nrow=2))$p.value)
flt_gene_data$no_fin_asj = apply(flt_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:6]),
as.numeric(elt[10:14])),
byrow=T, nrow=2))$p.value)
for (i in 2:6) {
flt_gene_data[, ncol(flt_gene_data) + 1] = apply(flt_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:6][c(2:6) != i]),
as.numeric(elt[10:14][c(10:14) != 8+i])),
byrow=T, nrow=2))$p.value)
}
colnames(flt_gene_data) = c(colnames(flt_gene_data)[1:24], 'no_afr',
'no_sas', 'no_amr', 'no_eas', 'no_nfe')
flt_gene_data$s_main = sapply(as.character(flt_gene_data$gene),
function(x) as.numeric(cpl_head[as.character(cpl_head$gene) == x,
's_main'])[1])
#flt_gene_data = flt_gene_data[flt_gene_data$s_main >= 0.02, ]
tpl = melt(flt_gene_data[, c(1, 21:29)], id.vars=c('gene', 'disease', 'dbin'))
tpl = na.omit(tpl[tpl$value < 1e-5, ])
sum(tpl$variable == 'whole')
## [1] 1775
sum(tpl$variable == 'whole' & tpl$dbin == 1)
## [1] 336
sum(tpl$variable == 'no_fin_asj')
## [1] 1252
sum(tpl$variable == 'no_fin_asj' & tpl$dbin == 1)
## [1] 212
ggplot(tpl, aes(x=variable, fill=as.factor(dbin))) + geom_bar() +
theme_bw() + coord_flip()
flt_gene_data$no_fin_asj = sapply(flt_gene_data$no_fin_asj,
function(x) ifelse(is.na(x), 1, x))
flt_gene_data$no_fin_asj_prior = sapply(as.character(flt_gene_data$gene),
function(x) as.numeric(cpl_head[as.character(cpl_head$gene) == x, 'no_fin_asj'])[1])
ggplot(flt_gene_data, aes(x=-log10(no_fin_asj_prior), y=-log10(no_fin_asj))) +
geom_hex(aes(fill=log10(..count..))) +
theme_bw() +
xlab('chi-squared p-value pre-filter') +
ylab('chi-squared p-value post-filter') +
scale_fill_gradientn(colours = mypal)
table(flt_gene_data$no_fin_asj > 1e-5 & flt_gene_data$no_fin_asj_prior < 1e-5)
##
## FALSE TRUE
## 12820 758
binconf(sum(flt_gene_data$no_fin_asj >= 1e-5 &
flt_gene_data$no_fin_asj_prior < 1e-5, na.rm=T),
sum(flt_gene_data$no_fin_asj_prior < 1e-5, na.rm=T))
## PointEst Lower Upper
## 0.4073079 0.3851981 0.4297996
Compare the percentage of genes with no imbalance after filtering with RF model compared to random exclusion of variants:
vars_random = all_vars[sample(1:nrow(all_vars), nrow(vars_non_true)), ]
cov_data = read.table('./covered_genes.tsv', header=T, sep='\t')
vars_random = vars_random[as.character(vars_random$gene) %in% as.character(cov_data$name), ]
vars_random$gene = droplevels(vars_random$gene)
ac_cols = grepl('AC_', colnames(vars_random))
an_cols = grepl('AN_', colnames(vars_random))
gene_data = data.frame(gene = sort(unique(vars_random$gene)))
for (i in colnames(vars_random)[ac_cols]){
gene_data[, i] = as.numeric(by(vars_random[, i], vars_random$gene, sum))
}
gene_data$AC_main = gene_data$AC_nfe + gene_data$AC_afr + gene_data$AC_amr +
gene_data$AC_eas + gene_data$AC_sas
for (i in colnames(vars_random)[an_cols]){
gene_data[, i] = as.numeric(by(vars_random[, i], vars_random$gene, mean))
}
gene_data$AN_main = gene_data$AN_nfe + gene_data$AN_afr + gene_data$AN_amr +
gene_data$AN_eas + gene_data$AN_sas
# Add covered genes with no pLoF
dummy_row = c(rep(0, 8), rep(colMeans(gene_data[, 10:17])))
zero_genes = unique(as.character(cov_data$name)[!(as.character(cov_data$name)
%in% as.character(gene_data$gene))])
remainder = as.data.frame(t(sapply(zero_genes, function(x) c(x, dummy_row))))
rownames(remainder) = c()
colnames(remainder) = colnames(gene_data)
random_gene_data = as.data.frame(rbind(gene_data, remainder))
num_cols = colnames(random_gene_data)[grepl('A[CN]_', colnames(random_gene_data))]
for (col in num_cols){
random_gene_data[, col] = as.numeric(as.character(random_gene_data[, col]))
}
random_gene_data$gene = as.factor(as.character(random_gene_data$gene))
random_gene_data =
random_gene_data[random_gene_data$AC_main/random_gene_data$AN_main <= 0.001, ]
# Cassa comparison
cassa = read.table('cassa_table.csv', header=T, sep='\t')
rownames(cassa) = cassa$gene_symbol
random_gene_data$U = sapply(random_gene_data$gene,
function(x) cassa[as.character(x), 'U'])
random_gene_data = random_gene_data[!(is.na(random_gene_data$U)), ]
random_gene_data[is.na(random_gene_data)] = 1
perc = ecdf(random_gene_data$U)
random_gene_data$tercile = floor(perc(random_gene_data$U) * 3)
random_gene_data$tercile = ifelse(random_gene_data$tercile < 3,
random_gene_data$tercile + 1,
random_gene_data$tercile)
table(random_gene_data$tercile)
##
## 1 2 3
## 5323 5329 5341
random_gene_data$shet_cassa = sapply(random_gene_data$gene,
function(x) cassa[as.character(x), 's_het'])
table(is.na(random_gene_data$shet_cassa))
##
## FALSE
## 15993
rownames(random_gene_data) = random_gene_data$gene
inheritance = read.table('genes_inheritance.txt', header=F, sep='\t',
stringsAsFactors = F, row.names=1)
random_gene_data$disease = sapply(rownames(random_gene_data),
function(x) ifelse(x%in% rownames(inheritance),
inheritance[x, 'V2'], 'none'))
random_gene_data$dbin = sapply(random_gene_data$disease,
function(x) ifelse(x == "none", 0, 1))
random_gene_data$whole = apply(random_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:8]),
as.numeric(elt[10:16])),
byrow=T, nrow=2))$p.value)
random_gene_data$no_fin_asj = apply(random_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:6]),
as.numeric(elt[10:14])),
byrow=T, nrow=2))$p.value)
for (i in 2:6) {
random_gene_data[, ncol(random_gene_data) + 1] = apply(random_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:6][c(2:6) != i]),
as.numeric(elt[10:14][c(10:14) != 8+i])),
byrow=T, nrow=2))$p.value)
}
colnames(random_gene_data) = c(colnames(random_gene_data)[1:24], 'no_afr',
'no_sas', 'no_amr', 'no_eas', 'no_nfe')
random_gene_data$s_main = sapply(as.character(random_gene_data$gene),
function(x) as.numeric(cpl_head[as.character(cpl_head$gene) == x,
's_main'])[1])
#random_gene_data = random_gene_data[random_gene_data$s_main >= 0.02, ]
tpl = melt(random_gene_data[, c(1, 21:29)], id.vars=c('gene', 'disease', 'dbin'))
tpl = na.omit(tpl[tpl$value < 1e-5, ])
sum(tpl$variable == 'whole')
## [1] 2033
sum(tpl$variable == 'whole' & tpl$dbin == 1)
## [1] 374
sum(tpl$variable == 'no_fin_asj')
## [1] 1477
sum(tpl$variable == 'no_fin_asj' & tpl$dbin == 1)
## [1] 240
ggplot(tpl, aes(x=variable, fill=as.factor(dbin))) + geom_bar() +
theme_bw() + coord_flip()
random_gene_data$no_fin_asj = sapply(random_gene_data$no_fin_asj,
function(x) ifelse(is.na(x), 1, x))
random_gene_data$no_fin_asj_prior = sapply(as.character(random_gene_data$gene),
function(x) as.numeric(cpl_head[as.character(cpl_head$gene) == x, 'no_fin_asj'])[1])
ggplot(random_gene_data, aes(x=-log10(no_fin_asj_prior), y=-log10(no_fin_asj))) +
geom_hex(aes(fill=log10(..count..))) +
theme_bw() +
xlab('chi-squared p-value pre-filter') +
ylab('chi-squared p-value post-filter') +
scale_fill_gradientn(colours = mypal)
table(random_gene_data$no_fin_asj > 1e-5 & random_gene_data$no_fin_asj_prior < 1e-5)
##
## FALSE TRUE
## 12988 589
binconf(sum(random_gene_data$no_fin_asj >= 1e-5 &
random_gene_data$no_fin_asj_prior < 1e-5, na.rm=T),
sum(random_gene_data$no_fin_asj_prior < 1e-5, na.rm=T))
## PointEst Lower Upper
## 0.3166667 0.2959227 0.3381663
And with the matching in terms of the number of removed variants pext-based filter.
vars_pext_flt = all_vars[!(all_vars$pext_avg < 0.6), ]
cov_data = read.table('./covered_genes.tsv', header=T, sep='\t')
vars_pext_flt = vars_pext_flt[as.character(vars_pext_flt$gene) %in% as.character(cov_data$name), ]
vars_pext_flt$gene = droplevels(vars_pext_flt$gene)
ac_cols = grepl('AC_', colnames(vars_pext_flt))
an_cols = grepl('AN_', colnames(vars_pext_flt))
gene_data = data.frame(gene = sort(unique(vars_pext_flt$gene)))
for (i in colnames(vars_pext_flt)[ac_cols]){
gene_data[, i] = as.numeric(by(vars_pext_flt[, i], vars_pext_flt$gene, sum))
}
gene_data$AC_main = gene_data$AC_nfe + gene_data$AC_afr + gene_data$AC_amr +
gene_data$AC_eas + gene_data$AC_sas
for (i in colnames(vars_pext_flt)[an_cols]){
gene_data[, i] = as.numeric(by(vars_pext_flt[, i], vars_pext_flt$gene, mean))
}
gene_data$AN_main = gene_data$AN_nfe + gene_data$AN_afr + gene_data$AN_amr +
gene_data$AN_eas + gene_data$AN_sas
# Add covered genes with no pLoF
dummy_row = c(rep(0, 8), rep(colMeans(gene_data[, 10:17])))
zero_genes = unique(as.character(cov_data$name)[!(as.character(cov_data$name)
%in% as.character(gene_data$gene))])
remainder = as.data.frame(t(sapply(zero_genes, function(x) c(x, dummy_row))))
rownames(remainder) = c()
colnames(remainder) = colnames(gene_data)
pext_gene_data = as.data.frame(rbind(gene_data, remainder))
num_cols = colnames(pext_gene_data)[grepl('A[CN]_', colnames(pext_gene_data))]
for (col in num_cols){
pext_gene_data[, col] = as.numeric(as.character(pext_gene_data[, col]))
}
pext_gene_data$gene = as.factor(as.character(pext_gene_data$gene))
pext_gene_data =
pext_gene_data[pext_gene_data$AC_main/pext_gene_data$AN_main <= 0.001, ]
# Cassa comparison
cassa = read.table('cassa_table.csv', header=T, sep='\t')
rownames(cassa) = cassa$gene_symbol
pext_gene_data$U = sapply(pext_gene_data$gene,
function(x) cassa[as.character(x), 'U'])
pext_gene_data = pext_gene_data[!(is.na(pext_gene_data$U)), ]
pext_gene_data[is.na(pext_gene_data)] = 1
perc = ecdf(pext_gene_data$U)
pext_gene_data$tercile = floor(perc(pext_gene_data$U) * 3)
pext_gene_data$tercile = ifelse(pext_gene_data$tercile < 3,
pext_gene_data$tercile + 1,
pext_gene_data$tercile)
table(pext_gene_data$tercile)
##
## 1 2 3
## 5323 5329 5342
pext_gene_data$shet_cassa = sapply(pext_gene_data$gene,
function(x) cassa[as.character(x), 's_het'])
table(is.na(pext_gene_data$shet_cassa))
##
## FALSE
## 15994
rownames(pext_gene_data) = pext_gene_data$gene
inheritance = read.table('genes_inheritance.txt', header=F, sep='\t',
stringsAsFactors = F, row.names=1)
pext_gene_data$disease = sapply(rownames(pext_gene_data),
function(x) ifelse(x%in% rownames(inheritance),
inheritance[x, 'V2'], 'none'))
pext_gene_data$dbin = sapply(pext_gene_data$disease,
function(x) ifelse(x == "none", 0, 1))
pext_gene_data$whole = apply(pext_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:8]),
as.numeric(elt[10:16])),
byrow=T, nrow=2))$p.value)
pext_gene_data$no_fin_asj = apply(pext_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:6]),
as.numeric(elt[10:14])),
byrow=T, nrow=2))$p.value)
for (i in 2:6) {
pext_gene_data[, ncol(pext_gene_data) + 1] = apply(pext_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:6][c(2:6) != i]),
as.numeric(elt[10:14][c(10:14) != 8+i])),
byrow=T, nrow=2))$p.value)
}
colnames(pext_gene_data) = c(colnames(pext_gene_data)[1:24], 'no_afr',
'no_sas', 'no_amr', 'no_eas', 'no_nfe')
pext_gene_data$s_main = sapply(as.character(pext_gene_data$gene),
function(x) as.numeric(cpl_head[as.character(cpl_head$gene) == x,
's_main'])[1])
#pext_gene_data = pext_gene_data[pext_gene_data$s_main >= 0.02, ]
tpl = melt(pext_gene_data[, c(1, 21:29)], id.vars=c('gene', 'disease', 'dbin'))
tpl = na.omit(tpl[tpl$value < 1e-5, ])
sum(tpl$variable == 'whole')
## [1] 1774
sum(tpl$variable == 'whole' & tpl$dbin == 1)
## [1] 326
sum(tpl$variable == 'no_fin_asj')
## [1] 1286
sum(tpl$variable == 'no_fin_asj' & tpl$dbin == 1)
## [1] 202
ggplot(tpl, aes(x=variable, fill=as.factor(dbin))) + geom_bar() +
theme_bw() + coord_flip()
pext_gene_data$no_fin_asj = sapply(pext_gene_data$no_fin_asj,
function(x) ifelse(is.na(x), 1, x))
pext_gene_data$no_fin_asj_prior = sapply(as.character(pext_gene_data$gene),
function(x) as.numeric(cpl_head[as.character(cpl_head$gene) == x, 'no_fin_asj'])[1])
ggplot(pext_gene_data, aes(x=-log10(no_fin_asj_prior), y=-log10(no_fin_asj))) +
geom_hex(aes(fill=log10(..count..))) +
theme_bw() +
xlab('chi-squared p-value pre-filter') +
ylab('chi-squared p-value post-filter') +
scale_fill_gradientn(colours = mypal)
table(pext_gene_data$no_fin_asj > 1e-4 & pext_gene_data$no_fin_asj_prior < 1e-5)
##
## FALSE TRUE
## 12970 608
binconf(sum(pext_gene_data$no_fin_asj >= 1e-5 &
pext_gene_data$no_fin_asj_prior < 1e-5, na.rm=T),
sum(pext_gene_data$no_fin_asj_prior < 1e-5, na.rm=T))
## PointEst Lower Upper
## 0.3353036 0.3142133 0.3570724
print('There are no E-s in Klavan')
## [1] "There are no E-s in Klavan"
write.table(all_vars, file='gnomad_ptv_annotated.tsv', sep='\t',
quote=F, row.names=F)
Let’s see whether we can aggregate the results of the two predictors to yield something better. :)
all_vars$geomean = sqrt(all_vars$prediction_prob_b * all_vars$pred_all_data)
pROC_obj <- roc(all_vars$violator, all_vars$geomean,
smoothed = TRUE,
# arguments for ci
ci=TRUE, ci.alpha=0.9, stratified=FALSE,
# arguments for plot
plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
print.auc=TRUE, show.thres=TRUE)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
sens.ci <- ci.se(pROC_obj, conf.level = 0.99)
plot(sens.ci, type="shape", col="lightblue")
#plot(sens.ci, type="bars")
#dev.off()
all_vars$max_prob = ifelse(all_vars$prediction_prob_b > all_vars$pred_all_data,
all_vars$prediction_prob_b, all_vars$pred_all_data)
pROC_obj <- roc(all_vars$violator, all_vars$max_prob,
smoothed = TRUE,
# arguments for ci
ci=TRUE, ci.alpha=0.9, stratified=FALSE,
# arguments for plot
plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
print.auc=TRUE, show.thres=TRUE)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
sens.ci <- ci.se(pROC_obj, conf.level = 0.99)
plot(sens.ci, type="shape", col="lightblue")
#plot(sens.ci, type="bars")
#dev.off()
all_vars$binwise_prob = ifelse(all_vars$s_bin == 'high',
all_vars$prediction_prob_b,
all_vars$pred_all_data)
pROC_obj <- roc(all_vars$violator, all_vars$binwise_prob,
smoothed = TRUE,
# arguments for ci
ci=TRUE, ci.alpha=0.9, stratified=FALSE,
# arguments for plot
plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
print.auc=TRUE, show.thres=TRUE)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
sens.ci <- ci.se(pROC_obj, conf.level = 0.99)
plot(sens.ci, type="shape", col="lightblue")
#plot(sens.ci, type="bars")
#dev.off()
While simply combining prediction results it not OK, bin-wise choice of the prediction model enhances the power of the analysis. Will it make ClinVar dissection also better?
all_clv$s_main = sapply(as.character(all_clv$gene),
function(x) ifelse(x %in% rownames(cpl_head),
cpl_head[x, 's_main'],
0.05))
all_clv$s_bin = ifelse(all_clv$s_main > 0.02, 'high',
ifelse(all_clv$s_main > 0.006, 'medium', 'low'))
all_clv$model_a = predict(my_forest_b, all_clv, type='vote')[, 2]
all_clv$model_b = predict(my_forest_c, all_clv, type='vote')[, 2]
all_clv$binwise = ifelse(all_clv$s_bin == 'high',
all_clv$model_a, all_clv$model_b)
all_clv_sub = all_clv[ifelse(as.character(all_clv$benign) == 'yes', TRUE,
runif(nrow(all_clv)) < 0.1), ]
pROC_obj <- roc(all_clv_sub$benign, all_clv_sub$pext_avg,
smoothed = TRUE,
# arguments for ci
ci=TRUE, ci.alpha=0.9, stratified=FALSE,
# arguments for plot
plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
print.auc=TRUE, show.thres=TRUE)
## Setting levels: control = no, case = yes
## Setting direction: controls > cases
sens.ci <- ci.se(pROC_obj)
plot(sens.ci, type="shape", col="lightblue")
pROC_obj <- roc(all_clv_sub$benign, all_clv_sub$binwise,
smoothed = TRUE,
# arguments for ci
ci=TRUE, ci.alpha=0.9, stratified=FALSE,
# arguments for plot
plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
print.auc=TRUE, show.thres=TRUE)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
sens.ci <- ci.se(pROC_obj)
plot(sens.ci, type="shape", col="lightblue")
Hmm, nope combining the models makes ClinVar dissection worse. Nevertheless, the “b” model works pretty great in all tests despite having low initial AUC. :)
Let’s create a big masterplot!
all_vars$pext_class_50 = all_vars$pext_avg < 0.5
all_vars$pext_class_25 = all_vars$pext_avg < 0.25
all_vars$pext_class_10 = all_vars$pext_avg < 0.1
snippet_gn = all_vars[, c('violator', 's_bin', 'RF_class_25',
'RF_class_50', 'RF_class_75',
'RF_class_25_c', 'RF_class_50_c',
'RF_class_75_c', 'pext_class_50',
'pext_class_25', 'pext_class_10')]
all_clv$pext_class_50 = all_clv$pext_avg < 0.5
all_clv$pext_class_25 = all_clv$pext_avg < 0.25
all_clv$pext_class_10 = all_clv$pext_avg < 0.1
all_clv$RF_class_25 = all_clv$model_a > 0.25
all_clv$RF_class_50 = all_clv$model_a > 0.5
all_clv$RF_class_75 = all_clv$model_a > 0.75
all_clv$RF_class_25_c = all_clv$model_b > 0.25
all_clv$RF_class_50_c = all_clv$model_b > 0.5
all_clv$RF_class_75_c = all_clv$model_b > 0.75
all_clv$violator = ifelse(all_clv$benign == 'yes',
'clv_benign', 'clv_patho')
snippet_clv = all_clv[, c('violator', 's_bin', 'RF_class_25',
'RF_class_50', 'RF_class_75',
'RF_class_25_c', 'RF_class_50_c',
'RF_class_75_c', 'pext_class_50',
'pext_class_25', 'pext_class_10')]
total = as.data.frame(rbind(snippet_gn, snippet_clv))
total_mlt = melt(total, id.vars=c('violator', 's_bin'))
total_mlt$model = ifelse(total_mlt$variable %in% c('RF_class_25',
'RF_class_50',
'RF_class_75'),
'model_a',
ifelse(total_mlt$variable %in% c('RF_class_25_c',
'RF_class_50_c',
'RF_class_75_c'),
'model_b', 'pext'))
total_mlt$cutoff = ifelse(total_mlt$variable %in% c('RF_class_25',
'RF_class_25_c',
'pext_class_50'),
'loose',
ifelse(total_mlt$variable %in% c('RF_class_50',
'RF_class_50_c',
'pext_class_25'),
'moderate', 'strict'))
colnames(total_mlt) = c('violator', 's_bin', 'classifier',
'class', 'model', 'cutoff')
ttl_stats = as.data.frame(as.matrix(aggregate(class~violator+s_bin+model+cutoff,
total_mlt,
function(x) binconf(sum(x), length(x) - sum(is.na(x))))))
colnames(ttl_stats) = c('violator', 's_bin', 'model', 'cutoff',
'PointEst', 'Lower', 'Upper')
ttl_stats$PointEst = as.numeric(as.character(ttl_stats$PointEst))
ttl_stats$Lower = as.numeric(as.character(ttl_stats$Lower))
ttl_stats$Upper = as.numeric(as.character(ttl_stats$Upper))
ttl_stats$s_bin = factor(ttl_stats$s_bin,
levels=c('low', 'medium', 'high'))
ggplot(ttl_stats, aes(x=s_bin, y=PointEst, fill=violator)) +
geom_bar(col='black', stat='identity', position='dodge') +
geom_errorbar(aes(ymin=Lower, ymax=Upper), lwd=0.5,
width=0.4, position=position_dodge(width=1)) +
theme_bw() + xlab('Conservation level') +
ylab('% predicted positive') +
facet_grid(vars(model), vars(cutoff)) +
theme(axis.text.x=element_text(angle=90, hjust=1)) +
scale_y_continuous(limits=c(0, 1))
Well, we’ve shown that the model trained to dissect violator and non-violator genes is efficiently predicting ClinVar benign status of a variant. But can we obtain a better predictor by fitting the model directly to ClinVar data? Let’s examine the possibility.
all_clv$benign = factor(all_clv$benign, levels=c('no','yes'))
train_ind = sample(1:nrow(all_clv), 0.7*nrow(all_clv))
trainset = all_clv[train_ind, ]
include = as.character(trainset$benign) == 'yes' | runif(nrow(trainset)) < 0.1
trainset_balanced = trainset[include, ]
testset = all_clv[-train_ind, ]
my_forest_d = randomForest(benign~ccr_exon_frac + is_mean +
is_max + rescue_conserved + pext_avg,
trainset_balanced, na.action = NULL)
prediction_clv = predict(my_forest_d, testset, type="vote")
testset$prediction_d = prediction_clv[, 2]
pROC_obj <- roc(testset$benign, testset$prediction_d,
smoothed = TRUE,
# arguments for ci
ci=TRUE, ci.alpha=0.9, stratified=FALSE,
# arguments for plot
plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
print.auc=TRUE, show.thres=TRUE)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
sens.ci <- ci.se(pROC_obj, conf.level = 0.99)
plot(sens.ci, type="shape", col="lightblue")
#plot(sens.ci, type="bars")
#dev.off()
violatoRF_d_res = data.frame(prob=pROC_obj$thresholds,
sens=pROC_obj$sensitivities,
spec=pROC_obj$specificities)
save(my_forest_d, file='classifier_d.RData')
The modelseems to work great! Let’s apply it to all gnomAD variants to see how it goes there.
all_vars$prediction_d = predict(my_forest_d, all_vars, type='vote')[, 2]
pROC_obj <- roc(all_vars$violator, all_vars$prediction_d,
smoothed = TRUE,
# arguments for ci
ci=TRUE, ci.alpha=0.9, stratified=FALSE,
# arguments for plot
plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
print.auc=TRUE, show.thres=TRUE)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
sens.ci <- ci.se(pROC_obj, conf.level = 0.99)
plot(sens.ci, type="shape", col="lightblue")
The performance is just flat out. Let’s try to stratify the results to see if some differences in performance across \(s_{het}\) bins can be seen.
all_vars$RF_class_25_d = all_vars$prediction_d > 0.25
all_vars$RF_class_50_d = all_vars$prediction_d > 0.5
all_vars$RF_class_75_d = all_vars$prediction_d > 0.75
snippet = all_vars[, c('violator', 's_bin', 'RF_class_25_d',
'RF_class_50_d', 'RF_class_75_d', 'pext_class')]
snippet_molten = na.omit(melt(snippet, id.vars=c('violator', 's_bin')))
cutoff_stats = as.data.frame(as.matrix(aggregate(value~s_bin+violator+variable,
snippet_molten,
function(x) binconf(sum(x), length(x) - sum(is.na(x))))))
colnames(cutoff_stats) = c('s_bin', 'violator', 'classifier', 'PointEst',
'Lower', 'Upper')
cutoff_stats$PointEst = as.numeric(as.character(cutoff_stats$PointEst))
cutoff_stats$Lower = as.numeric(as.character(cutoff_stats$Lower))
cutoff_stats$Upper = as.numeric(as.character(cutoff_stats$Upper))
cutoff_stats$s_bin = factor(cutoff_stats$s_bin,
levels=c('low', 'medium', 'high'))
print(cutoff_stats)
## s_bin violator classifier PointEst Lower Upper
## 1 high no RF_class_25_d 0.46025412 0.45448030 0.46603863
## 2 low no RF_class_25_d 0.53672137 0.52893344 0.54449144
## 3 medium no RF_class_25_d 0.50460718 0.50030002 0.50891366
## 4 high yes RF_class_25_d 0.56737589 0.50902832 0.62391250
## 5 low yes RF_class_25_d 0.54972196 0.54142260 0.55799374
## 6 medium yes RF_class_25_d 0.46468717 0.45048557 0.47894626
## 7 high no RF_class_50_d 0.20595751 0.20130785 0.21068623
## 8 low no RF_class_50_d 0.26753691 0.26068747 0.27449950
## 9 medium no RF_class_50_d 0.24135067 0.23768385 0.24505587
## 10 high yes RF_class_50_d 0.34397163 0.29095949 0.40117754
## 11 low yes RF_class_50_d 0.27211670 0.26476793 0.27959189
## 12 medium yes RF_class_50_d 0.22078473 0.20917552 0.23284854
## 13 high no RF_class_75_d 0.05775491 0.05510883 0.06051990
## 14 low no RF_class_75_d 0.05766428 0.05413383 0.06141003
## 15 medium no RF_class_75_d 0.06164159 0.05960216 0.06374607
## 16 high yes RF_class_75_d 0.19503546 0.15301782 0.24525002
## 17 low yes RF_class_75_d 0.07553983 0.07125507 0.08006003
## 18 medium yes RF_class_75_d 0.06956522 0.06264829 0.07718295
## 19 high no pext_class 0.04017949 0.03814633 0.04231624
## 20 low no pext_class 0.04868144 0.04578085 0.05175584
## 21 medium no pext_class 0.03949843 0.03799601 0.04105773
## 22 high yes pext_class 0.21465969 0.17643388 0.25856722
## 23 low yes pext_class 0.05896585 0.05559707 0.06252523
## 24 medium yes pext_class 0.04189944 0.03701045 0.04740247
ggplot(cutoff_stats, aes(x=s_bin, y=PointEst, fill=s_bin)) +
geom_bar(col='black', stat='identity') +
geom_errorbar(aes(ymin=Lower, ymax=Upper), lwd=0.5, width=0.4) +
theme_bw() + guides(fill=F) + xlab('Conservation level') +
ylab('% predicted positive') +
facet_grid(vars(violator), vars(classifier))
ggplot(cutoff_stats, aes(x=s_bin, y=PointEst, fill=violator)) +
geom_bar(col='black', stat='identity', position='dodge') +
geom_errorbar(aes(ymin=Lower, ymax=Upper), lwd=0.5,
width=0.4, position=position_dodge(width=1)) +
theme_bw() + xlab('Conservation level') +
ylab('% predicted positive') +
facet_wrap(~classifier, nrow=1) +
theme(axis.text.x=element_text(angle=90, hjust=1)) +
scale_y_continuous(limits=c(0,1))
Some difference can be seen across strata; moreover, we clearly see that the proportion of predicted low-confidence LoFs is greater in violator genes with high conservation level. This is very nice! Let’s create one more “master plot” with all variant classes under consideration:
all_vars$pext_class_50 = all_vars$pext_avg < 0.5
all_vars$pext_class_25 = all_vars$pext_avg < 0.25
all_vars$pext_class_10 = all_vars$pext_avg < 0.1
snippet_gn = all_vars[, c('violator', 's_bin', 'RF_class_25_c',
'RF_class_50_c', 'RF_class_75_c',
'RF_class_25_d', 'RF_class_50_d',
'RF_class_75_d', 'pext_class_50',
'pext_class_25', 'pext_class_10')]
all_clv$model_clinvar = predict(my_forest_d, all_clv, type='vote')[, 2]
all_clv$pext_class_50 = all_clv$pext_avg < 0.5
all_clv$pext_class_25 = all_clv$pext_avg < 0.25
all_clv$pext_class_10 = all_clv$pext_avg < 0.1
all_clv$RF_class_25_d = all_clv$model_clinvar > 0.25
all_clv$RF_class_50_d = all_clv$model_clinvar > 0.5
all_clv$RF_class_75_d = all_clv$model_clinvar > 0.75
all_clv$RF_class_25_c = all_clv$model_b > 0.25
all_clv$RF_class_50_c = all_clv$model_b > 0.5
all_clv$RF_class_75_c = all_clv$model_b > 0.75
all_clv$violator = ifelse(all_clv$benign == 'yes',
'clv_benign', 'clv_patho')
snippet_clv = all_clv[, c('violator', 's_bin', 'RF_class_25_d',
'RF_class_50_d', 'RF_class_75_d',
'RF_class_25_c', 'RF_class_50_c',
'RF_class_75_c', 'pext_class_50',
'pext_class_25', 'pext_class_10')]
total = as.data.frame(rbind(snippet_gn, snippet_clv))
total_mlt = melt(total, id.vars=c('violator', 's_bin'))
total_mlt$model = ifelse(total_mlt$variable %in% c('RF_class_25_d',
'RF_class_50_d',
'RF_class_75_d'),
'ClinVar',
ifelse(total_mlt$variable %in% c('RF_class_25_c',
'RF_class_50_c',
'RF_class_75_c'),
'gnomAD', 'pext'))
total_mlt$cutoff = ifelse(total_mlt$variable %in% c('RF_class_25_d',
'RF_class_25_c',
'pext_class_50'),
'loose',
ifelse(total_mlt$variable %in% c('RF_class_50_d',
'RF_class_50_c',
'pext_class_25'),
'moderate', 'strict'))
colnames(total_mlt) = c('violator', 's_bin', 'classifier',
'class', 'model', 'cutoff')
ttl_stats = as.data.frame(as.matrix(aggregate(class~violator+s_bin+model+cutoff,
total_mlt,
function(x) binconf(sum(x), length(x) - sum(is.na(x))))))
colnames(ttl_stats) = c('violator', 's_bin', 'model', 'cutoff',
'PointEst', 'Lower', 'Upper')
ttl_stats$PointEst = as.numeric(as.character(ttl_stats$PointEst))
ttl_stats$Lower = as.numeric(as.character(ttl_stats$Lower))
ttl_stats$Upper = as.numeric(as.character(ttl_stats$Upper))
ttl_stats$s_bin = factor(ttl_stats$s_bin,
levels=c('low', 'medium', 'high'))
ggplot(ttl_stats, aes(x=s_bin, y=PointEst, fill=violator)) +
geom_bar(col='black', stat='identity', position='dodge') +
geom_errorbar(aes(ymin=Lower, ymax=Upper), lwd=0.5,
width=0.4, position=position_dodge(width=0.9)) +
theme_bw() + xlab('Conservation level') +
ylab('% predicted positive') +
facet_grid(vars(model), vars(cutoff)) +
theme(axis.text.x=element_text(angle=90, hjust=1)) +
scale_y_continuous(limits=c(0, 1))
We also can try to test for the exclusion of variants (if the clinvar-trained model would help to eliminate the non-uniformity of PTV count distribution).
vars_non_true = all_vars[all_vars$RF_class_25_d %in% c(FALSE, NA), ]
gene_level = aggregate(AC_main ~ gene, vars_non_true, sum)
gene_level$source_ac = sapply(as.character(gene_level$gene), function(x) cpl_head[x, 'AC_main'])
ggplot(gene_level, aes(source_ac, AC_main)) +
geom_hex(aes(fill=log10(..count..))) +
theme_bw() + scale_fill_gradientn(colours = mypal)
What happens to the imbalance of PTV counts?
cov_data = read.table('./covered_genes.tsv', header=T, sep='\t')
vars_non_true = vars_non_true[as.character(vars_non_true$gene) %in% as.character(cov_data$name), ]
vars_non_true$gene = droplevels(vars_non_true$gene)
ac_cols = grepl('AC_', colnames(vars_non_true))
an_cols = grepl('AN_', colnames(vars_non_true))
gene_data = data.frame(gene = sort(unique(vars_non_true$gene)))
for (i in colnames(vars_non_true)[ac_cols]){
gene_data[, i] = as.numeric(by(vars_non_true[, i], vars_non_true$gene, sum))
}
gene_data$AC_main = gene_data$AC_nfe + gene_data$AC_afr + gene_data$AC_amr +
gene_data$AC_eas + gene_data$AC_sas
for (i in colnames(vars_non_true)[an_cols]){
gene_data[, i] = as.numeric(by(vars_non_true[, i], vars_non_true$gene, mean))
}
gene_data$AN_main = gene_data$AN_nfe + gene_data$AN_afr + gene_data$AN_amr +
gene_data$AN_eas + gene_data$AN_sas
# Add covered genes with no pLoF
dummy_row = c(rep(0, 8), rep(colMeans(gene_data[, 10:17])))
zero_genes = unique(as.character(cov_data$name)[!(as.character(cov_data$name)
%in% as.character(gene_data$gene))])
remainder = as.data.frame(t(sapply(zero_genes, function(x) c(x, dummy_row))))
rownames(remainder) = c()
colnames(remainder) = colnames(gene_data)
flt_gene_data = as.data.frame(rbind(gene_data, remainder))
num_cols = colnames(flt_gene_data)[grepl('A[CN]_', colnames(flt_gene_data))]
for (col in num_cols){
flt_gene_data[, col] = as.numeric(as.character(flt_gene_data[, col]))
}
flt_gene_data$gene = as.factor(as.character(flt_gene_data$gene))
flt_gene_data =
flt_gene_data[flt_gene_data$AC_main/flt_gene_data$AN_main <= 0.001, ]
# Cassa comparison
cassa = read.table('cassa_table.csv', header=T, sep='\t')
rownames(cassa) = cassa$gene_symbol
flt_gene_data$U = sapply(flt_gene_data$gene,
function(x) cassa[as.character(x), 'U'])
flt_gene_data = flt_gene_data[!(is.na(flt_gene_data$U)), ]
flt_gene_data[is.na(flt_gene_data)] = 1
perc = ecdf(flt_gene_data$U)
flt_gene_data$tercile = floor(perc(flt_gene_data$U) * 3)
flt_gene_data$tercile = ifelse(flt_gene_data$tercile < 3,
flt_gene_data$tercile + 1,
flt_gene_data$tercile)
table(flt_gene_data$tercile)
##
## 1 2 3
## 5323 5328 5341
flt_gene_data$shet_cassa = sapply(flt_gene_data$gene,
function(x) cassa[as.character(x), 's_het'])
table(is.na(flt_gene_data$shet_cassa))
##
## FALSE
## 15992
rownames(flt_gene_data) = flt_gene_data$gene
inheritance = read.table('genes_inheritance.txt', header=F, sep='\t',
stringsAsFactors = F, row.names=1)
flt_gene_data$disease = sapply(rownames(flt_gene_data),
function(x) ifelse(x%in% rownames(inheritance),
inheritance[x, 'V2'], 'none'))
flt_gene_data$dbin = sapply(flt_gene_data$disease,
function(x) ifelse(x == "none", 0, 1))
flt_gene_data$whole = apply(flt_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:8]),
as.numeric(elt[10:16])),
byrow=T, nrow=2))$p.value)
flt_gene_data$no_fin_asj = apply(flt_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:6]),
as.numeric(elt[10:14])),
byrow=T, nrow=2))$p.value)
for (i in 2:6) {
flt_gene_data[, ncol(flt_gene_data) + 1] = apply(flt_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:6][c(2:6) != i]),
as.numeric(elt[10:14][c(10:14) != 8+i])),
byrow=T, nrow=2))$p.value)
}
colnames(flt_gene_data) = c(colnames(flt_gene_data)[1:24], 'no_afr',
'no_sas', 'no_amr', 'no_eas', 'no_nfe')
flt_gene_data$s_main = sapply(as.character(flt_gene_data$gene),
function(x) as.numeric(cpl_head[as.character(cpl_head$gene) == x,
's_main'])[1])
#flt_gene_data = flt_gene_data[flt_gene_data$s_main >= 0.02, ]
tpl = melt(flt_gene_data[, c(1, 21:29)], id.vars=c('gene', 'disease', 'dbin'))
tpl = na.omit(tpl[tpl$value < 1e-5, ])
sum(tpl$variable == 'whole')
## [1] 1692
sum(tpl$variable == 'whole' & tpl$dbin == 1)
## [1] 325
sum(tpl$variable == 'no_fin_asj')
## [1] 1222
sum(tpl$variable == 'no_fin_asj' & tpl$dbin == 1)
## [1] 208
ggplot(tpl, aes(x=variable, fill=as.factor(dbin))) + geom_bar() +
theme_bw() + coord_flip()
flt_gene_data$no_fin_asj = sapply(flt_gene_data$no_fin_asj,
function(x) ifelse(is.na(x), 1, x))
flt_gene_data$no_fin_asj_prior = sapply(as.character(flt_gene_data$gene),
function(x) as.numeric(cpl_head[as.character(cpl_head$gene) == x, 'no_fin_asj'])[1])
ggplot(flt_gene_data, aes(x=-log10(no_fin_asj_prior), y=-log10(no_fin_asj))) +
geom_hex(aes(fill=log10(..count..))) +
theme_bw() +
xlab('chi-squared p-value pre-filter') +
ylab('chi-squared p-value post-filter') +
scale_fill_gradientn(colours = mypal)
table(flt_gene_data$no_fin_asj > 1e-5 & flt_gene_data$no_fin_asj_prior < 1e-5)
##
## FALSE TRUE
## 12782 794
binconf(sum(flt_gene_data$no_fin_asj >= 1e-5 &
flt_gene_data$no_fin_asj_prior < 1e-5, na.rm=T),
sum(flt_gene_data$no_fin_asj_prior < 1e-5, na.rm=T))
## PointEst Lower Upper
## 0.4266523 0.4043551 0.4492517
Compare the percentage of genes with no imbalance after filtering with RF model compared to random exclusion of variants:
vars_random = all_vars[sample(1:nrow(all_vars), nrow(vars_non_true)), ]
cov_data = read.table('./covered_genes.tsv', header=T, sep='\t')
vars_random = vars_random[as.character(vars_random$gene) %in% as.character(cov_data$name), ]
vars_random$gene = droplevels(vars_random$gene)
ac_cols = grepl('AC_', colnames(vars_random))
an_cols = grepl('AN_', colnames(vars_random))
gene_data = data.frame(gene = sort(unique(vars_random$gene)))
for (i in colnames(vars_random)[ac_cols]){
gene_data[, i] = as.numeric(by(vars_random[, i], vars_random$gene, sum))
}
gene_data$AC_main = gene_data$AC_nfe + gene_data$AC_afr + gene_data$AC_amr +
gene_data$AC_eas + gene_data$AC_sas
for (i in colnames(vars_random)[an_cols]){
gene_data[, i] = as.numeric(by(vars_random[, i], vars_random$gene, mean))
}
gene_data$AN_main = gene_data$AN_nfe + gene_data$AN_afr + gene_data$AN_amr +
gene_data$AN_eas + gene_data$AN_sas
# Add covered genes with no pLoF
dummy_row = c(rep(0, 8), rep(colMeans(gene_data[, 10:17])))
zero_genes = unique(as.character(cov_data$name)[!(as.character(cov_data$name)
%in% as.character(gene_data$gene))])
remainder = as.data.frame(t(sapply(zero_genes, function(x) c(x, dummy_row))))
rownames(remainder) = c()
colnames(remainder) = colnames(gene_data)
random_gene_data = as.data.frame(rbind(gene_data, remainder))
num_cols = colnames(random_gene_data)[grepl('A[CN]_', colnames(random_gene_data))]
for (col in num_cols){
random_gene_data[, col] = as.numeric(as.character(random_gene_data[, col]))
}
random_gene_data$gene = as.factor(as.character(random_gene_data$gene))
random_gene_data =
random_gene_data[random_gene_data$AC_main/random_gene_data$AN_main <= 0.001, ]
# Cassa comparison
cassa = read.table('cassa_table.csv', header=T, sep='\t')
rownames(cassa) = cassa$gene_symbol
random_gene_data$U = sapply(random_gene_data$gene,
function(x) cassa[as.character(x), 'U'])
random_gene_data = random_gene_data[!(is.na(random_gene_data$U)), ]
random_gene_data[is.na(random_gene_data)] = 1
perc = ecdf(random_gene_data$U)
random_gene_data$tercile = floor(perc(random_gene_data$U) * 3)
random_gene_data$tercile = ifelse(random_gene_data$tercile < 3,
random_gene_data$tercile + 1,
random_gene_data$tercile)
table(random_gene_data$tercile)
##
## 1 2 3
## 5323 5328 5341
random_gene_data$shet_cassa = sapply(random_gene_data$gene,
function(x) cassa[as.character(x), 's_het'])
table(is.na(random_gene_data$shet_cassa))
##
## FALSE
## 15992
rownames(random_gene_data) = random_gene_data$gene
inheritance = read.table('genes_inheritance.txt', header=F, sep='\t',
stringsAsFactors = F, row.names=1)
random_gene_data$disease = sapply(rownames(random_gene_data),
function(x) ifelse(x%in% rownames(inheritance),
inheritance[x, 'V2'], 'none'))
random_gene_data$dbin = sapply(random_gene_data$disease,
function(x) ifelse(x == "none", 0, 1))
random_gene_data$whole = apply(random_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:8]),
as.numeric(elt[10:16])),
byrow=T, nrow=2))$p.value)
random_gene_data$no_fin_asj = apply(random_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:6]),
as.numeric(elt[10:14])),
byrow=T, nrow=2))$p.value)
for (i in 2:6) {
random_gene_data[, ncol(random_gene_data) + 1] = apply(random_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:6][c(2:6) != i]),
as.numeric(elt[10:14][c(10:14) != 8+i])),
byrow=T, nrow=2))$p.value)
}
colnames(random_gene_data) = c(colnames(random_gene_data)[1:24], 'no_afr',
'no_sas', 'no_amr', 'no_eas', 'no_nfe')
random_gene_data$s_main = sapply(as.character(random_gene_data$gene),
function(x) as.numeric(cpl_head[as.character(cpl_head$gene) == x,
's_main'])[1])
#random_gene_data = random_gene_data[random_gene_data$s_main >= 0.02, ]
tpl = melt(random_gene_data[, c(1, 21:29)], id.vars=c('gene', 'disease', 'dbin'))
tpl = na.omit(tpl[tpl$value < 1e-5, ])
sum(tpl$variable == 'whole')
## [1] 1882
sum(tpl$variable == 'whole' & tpl$dbin == 1)
## [1] 345
sum(tpl$variable == 'no_fin_asj')
## [1] 1378
sum(tpl$variable == 'no_fin_asj' & tpl$dbin == 1)
## [1] 221
ggplot(tpl, aes(x=variable, fill=as.factor(dbin))) + geom_bar() +
theme_bw() + coord_flip()
random_gene_data$no_fin_asj = sapply(random_gene_data$no_fin_asj,
function(x) ifelse(is.na(x), 1, x))
random_gene_data$no_fin_asj_prior = sapply(as.character(random_gene_data$gene),
function(x) as.numeric(cpl_head[as.character(cpl_head$gene) == x, 'no_fin_asj'])[1])
ggplot(random_gene_data, aes(x=-log10(no_fin_asj_prior), y=-log10(no_fin_asj))) +
geom_hex(aes(fill=log10(..count..))) +
theme_bw() +
xlab('chi-squared p-value pre-filter') +
ylab('chi-squared p-value post-filter') +
scale_fill_gradientn(colours = mypal)
table(random_gene_data$no_fin_asj > 1e-5 & random_gene_data$no_fin_asj_prior < 1e-5)
##
## FALSE TRUE
## 12855 721
binconf(sum(random_gene_data$no_fin_asj >= 1e-5 &
random_gene_data$no_fin_asj_prior < 1e-5, na.rm=T),
sum(random_gene_data$no_fin_asj_prior < 1e-5, na.rm=T))
## PointEst Lower Upper
## 0.3876344 0.3657461 0.4099859
And with the matching in terms of the number of removed variants pext-based filter.
vars_pext_flt = all_vars[!(all_vars$pext_avg < 0.7), ]
cov_data = read.table('./covered_genes.tsv', header=T, sep='\t')
vars_pext_flt = vars_pext_flt[as.character(vars_pext_flt$gene) %in% as.character(cov_data$name), ]
vars_pext_flt$gene = droplevels(vars_pext_flt$gene)
ac_cols = grepl('AC_', colnames(vars_pext_flt))
an_cols = grepl('AN_', colnames(vars_pext_flt))
gene_data = data.frame(gene = sort(unique(vars_pext_flt$gene)))
for (i in colnames(vars_pext_flt)[ac_cols]){
gene_data[, i] = as.numeric(by(vars_pext_flt[, i], vars_pext_flt$gene, sum))
}
gene_data$AC_main = gene_data$AC_nfe + gene_data$AC_afr + gene_data$AC_amr +
gene_data$AC_eas + gene_data$AC_sas
for (i in colnames(vars_pext_flt)[an_cols]){
gene_data[, i] = as.numeric(by(vars_pext_flt[, i], vars_pext_flt$gene, mean))
}
gene_data$AN_main = gene_data$AN_nfe + gene_data$AN_afr + gene_data$AN_amr +
gene_data$AN_eas + gene_data$AN_sas
# Add covered genes with no pLoF
dummy_row = c(rep(0, 8), rep(colMeans(gene_data[, 10:17])))
zero_genes = unique(as.character(cov_data$name)[!(as.character(cov_data$name)
%in% as.character(gene_data$gene))])
remainder = as.data.frame(t(sapply(zero_genes, function(x) c(x, dummy_row))))
rownames(remainder) = c()
colnames(remainder) = colnames(gene_data)
pext_gene_data = as.data.frame(rbind(gene_data, remainder))
num_cols = colnames(pext_gene_data)[grepl('A[CN]_', colnames(pext_gene_data))]
for (col in num_cols){
pext_gene_data[, col] = as.numeric(as.character(pext_gene_data[, col]))
}
pext_gene_data$gene = as.factor(as.character(pext_gene_data$gene))
pext_gene_data =
pext_gene_data[pext_gene_data$AC_main/pext_gene_data$AN_main <= 0.001, ]
# Cassa comparison
cassa = read.table('cassa_table.csv', header=T, sep='\t')
rownames(cassa) = cassa$gene_symbol
pext_gene_data$U = sapply(pext_gene_data$gene,
function(x) cassa[as.character(x), 'U'])
pext_gene_data = pext_gene_data[!(is.na(pext_gene_data$U)), ]
pext_gene_data[is.na(pext_gene_data)] = 1
perc = ecdf(pext_gene_data$U)
pext_gene_data$tercile = floor(perc(pext_gene_data$U) * 3)
pext_gene_data$tercile = ifelse(pext_gene_data$tercile < 3,
pext_gene_data$tercile + 1,
pext_gene_data$tercile)
table(pext_gene_data$tercile)
##
## 1 2 3
## 5323 5329 5342
pext_gene_data$shet_cassa = sapply(pext_gene_data$gene,
function(x) cassa[as.character(x), 's_het'])
table(is.na(pext_gene_data$shet_cassa))
##
## FALSE
## 15994
rownames(pext_gene_data) = pext_gene_data$gene
inheritance = read.table('genes_inheritance.txt', header=F, sep='\t',
stringsAsFactors = F, row.names=1)
pext_gene_data$disease = sapply(rownames(pext_gene_data),
function(x) ifelse(x%in% rownames(inheritance),
inheritance[x, 'V2'], 'none'))
pext_gene_data$dbin = sapply(pext_gene_data$disease,
function(x) ifelse(x == "none", 0, 1))
pext_gene_data$whole = apply(pext_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:8]),
as.numeric(elt[10:16])),
byrow=T, nrow=2))$p.value)
pext_gene_data$no_fin_asj = apply(pext_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:6]),
as.numeric(elt[10:14])),
byrow=T, nrow=2))$p.value)
for (i in 2:6) {
pext_gene_data[, ncol(pext_gene_data) + 1] = apply(pext_gene_data,
1,
function(elt)
chisq.test(matrix(c(as.numeric(elt[2:6][c(2:6) != i]),
as.numeric(elt[10:14][c(10:14) != 8+i])),
byrow=T, nrow=2))$p.value)
}
colnames(pext_gene_data) = c(colnames(pext_gene_data)[1:24], 'no_afr',
'no_sas', 'no_amr', 'no_eas', 'no_nfe')
pext_gene_data$s_main = sapply(as.character(pext_gene_data$gene),
function(x) as.numeric(cpl_head[as.character(cpl_head$gene) == x,
's_main'])[1])
#pext_gene_data = pext_gene_data[pext_gene_data$s_main >= 0.02, ]
tpl = melt(pext_gene_data[, c(1, 21:29)], id.vars=c('gene', 'disease', 'dbin'))
tpl = na.omit(tpl[tpl$value < 1e-5, ])
sum(tpl$variable == 'whole')
## [1] 1592
sum(tpl$variable == 'whole' & tpl$dbin == 1)
## [1] 290
sum(tpl$variable == 'no_fin_asj')
## [1] 1154
sum(tpl$variable == 'no_fin_asj' & tpl$dbin == 1)
## [1] 179
ggplot(tpl, aes(x=variable, fill=as.factor(dbin))) + geom_bar() +
theme_bw() + coord_flip()
pext_gene_data$no_fin_asj = sapply(pext_gene_data$no_fin_asj,
function(x) ifelse(is.na(x), 1, x))
pext_gene_data$no_fin_asj_prior = sapply(as.character(pext_gene_data$gene),
function(x) as.numeric(cpl_head[as.character(cpl_head$gene) == x, 'no_fin_asj'])[1])
ggplot(pext_gene_data, aes(x=-log10(no_fin_asj_prior), y=-log10(no_fin_asj))) +
geom_hex(aes(fill=log10(..count..))) +
theme_bw() +
xlab('chi-squared p-value pre-filter') +
ylab('chi-squared p-value post-filter') +
scale_fill_gradientn(colours = mypal)
table(pext_gene_data$no_fin_asj > 1e-4 & pext_gene_data$no_fin_asj_prior < 1e-5)
##
## FALSE TRUE
## 12847 731
binconf(sum(pext_gene_data$no_fin_asj >= 1e-5 &
pext_gene_data$no_fin_asj_prior < 1e-5, na.rm=T),
sum(pext_gene_data$no_fin_asj_prior < 1e-5, na.rm=T))
## PointEst Lower Upper
## 0.4019344 0.379883 0.4243899
Yet another test - examine the AC distribution of varinats marked as lcpLoF by our classifiers!
One more hypothesis might be that the variants having higher prediction probability (i.e., higher chance of being low-confidence ploF) should also have higher AC as they tend to drive the imbalance of PTV counts across populations. Let’s examine this hypothesis:
ggplot(all_vars, aes(x=RF_class_25_d, y=AC_main, fill=RF_class_25_d)) +
geom_violin(alpha=0.5, lwd=0.5) +
geom_boxplot(width=0.4, lwd=0.5, outlier.shape=NA, fill='white') +
theme_bw() + scale_y_continuous(limits=c(0, 80))
ggplot(all_vars, aes(x=prediction_d, y=AC_main)) +
geom_hex(aes(fill=log10(..count..))) +
theme_bw() + scale_fill_gradientn(colours = mypal)
cor.test(all_vars$prediction_d, all_vars$AC_main)
##
## Pearson's product-moment correlation
##
## data: all_vars$prediction_d and all_vars$AC_main
## t = 7.668, df = 114959, p-value = 1.761e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.01683153 0.02838684
## sample estimates:
## cor
## 0.02260994
ggplot(all_vars, aes(x=pred_all_data, y=AC_main)) +
geom_hex(aes(fill=log10(..count..))) +
theme_bw() + scale_fill_gradientn(colours = mypal)
cor.test(all_vars$pred_all_data, all_vars$AC_main)
##
## Pearson's product-moment correlation
##
## data: all_vars$pred_all_data and all_vars$AC_main
## t = 17.125, df = 114959, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.04467476 0.05620656
## sample estimates:
## cor
## 0.05044234
It’s very cool to see that there is a very small but significant positive correlation between AC and prediction probability for both gnomAD-trained and ClinVar-trained model; moreover, consistently with the results of the “exclusion test”, the coefficient is higher for the gnomAD-based model.